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EXPRESS MAIL NO.: EV475140784US 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 
BEFORE THE BOARD OF APPEALS AND INTERFERENCES 



Application of: Nehls et al 



Confirmation No.: 9822 



Serial No.: 09/398,253 



Art Unit: 1637 



Filed: 



September 17, 1999 



Examiner: Young J. Kim 



For: 



NOVEL HUMAN 



Docket No.: 8535-026-999 



POLYNUCLEOTIDES AND 
POLYPEPTIDES ENCODED 
THEREBY 

APPELLANT'S BRIEF ON APPEAL UNDER 37 C.F.R. SS 1.191 AND 1,192 

Pursuant to the provisions of 37 C.F.R. §§ 1.191 and 1.192, an appeal is taken 
herein from the final rejection of claims 1, 3, 4, 10, and 12 of this application. Appellant 
submits an original and two copies of this appeal brief accompanied by: (1) a Petition for 
Extension of Time (in duplicate) for five months from February 29, 2004 up to and 
including July 29, 2004, accompanied by the appropriate fee; and (2) a Brief on Appeal Fee 
Transmittal Sheet (in duplicate). Appellant also submits herewith Exhibit A: an appendix 
of the claims {i.e., claims 1, 3, 4, 10, and 12) under appeal; Exhibit B: sequence alignments 
of SEQ ID NOS:9-18 with human genomic sequences in GenBank; and Exhibit C: Ventner 
et al., 2001, Science 291:1304. 



The inventors have assigned the entire right and interest in the instant 
application to Lexicon Genetics Incorporated, 8800 Technology Forest Place, The 
Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellant is not aware of any other appeals or interferences which will 
directly affect, or be directly affected by, or having a bearing on the Board's decision in the 
present appeal. 



elected claims 5-9 were withdrawn from consideration by the Examiner. Claim 2 was 



I. REAL PARTY IN INTEREST 



III. STATUS OF CLAIMS 



Original claims 1-4 of this application were elected for prosecution and non- 
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canceled without prejudice; claims 1, 3, and 4 were amended; and new claims 10 and 1 1 1 
were added in an Amendment filed on April 11, 2001. Claims 1, 3, 4, and 10-1 1 have been 
finally rejected in an Office Action dated July 3, 2001. Claims 1 and 1 1 were further 
amended in an Amendment filed on January 3, 2002. A Notice of Appeal was filed on 
January 3, 2002 appealing the rejection of claims 1, 3, 4, and 10-11. An Appellant's Brief 
on Appeal was filed August 5, 2002, contents of which are incorporated by reference in its 
entirety. The Examiner reopened prosecution in an Office Action dated November 5, 2002. 
Claim 1 1 was canceled, claims 1,3, and 10 were further amended, and new claim 12 was 
added in an Amendment filed on April 7, 2003. Claims 1, 3, 4, 10, and 12 have been finally 
rejected in an Office Action dated July 2, 2003. A Notice of Appeal was filed on December 
31, 2003 appealing the rejection of claims 1, 3, 4, 10 and 12. 

IV. STATUS OF AMENDMENTS 

Subsequent to the final rejection in the Office Action dated July 2, 2003, 
Appellant submitted a Response under Rule 116 dated December 31, 2003 requesting 
reconsideration in an attempt to secure allowance of claims. The request for reconsideration 
has been considered but did not place the application in condition for allowance, as indicated 
in the Advisory Action from the Examiner mailed January 30, 2004. The Appellants' Brief 
On Appeal is directed at claims 1,3,4, 10, and 12. A copy of the claims involved in this 
Appeal is presented in the attached Exhibit A. 

V. SUMMARY OF THE INVENTION 

The present invention, as described and claimed, relates to oligonucleotides 
and polynucleotides that are disclosed as SEQ ID NOS: 9-18 in the Sequence Listing. These 
oligonucleotides and polynucleotides are discovered using gene trap technology in human 
teratocarcinoma cells. 

According to the invention, gene trap vectors are used in the invention to 
integrate into intron sequences of cellular genes ("the trapped genes") in a genome and two 
fusion transcripts are produced as a result. See page 4, lines 19-24; page 75, lines 1-30; and 
Figures 1 A to 1C of the specification. The first fusion transcript comprises the coding 
region of a selectable marker (neomycin resistance was used to produce the presently 
described oligonucleotides and polynucleotides) carried within the vector and the upstream 
exon(s) from the interrupted cellular gene. A mature transcript is generated when the splice 
donor (SD) and splice acceptor (SA) sites as shown in Figure 1C are spliced together. 



New claims 5 and 6 filed on July 3, 2001 were renumbered as new claims 10 and 1 1 by the Examiner. 

« NYJD: 1527970.2 



Translation of this transcript produces a fusion protein that allows for the selection of cells 
comprising an integrated gene trap vector. 

The second fusion transcript comprises exon 1 of the murine btk gene within 
the vector which is fused with exons of the trapped gene that are located downstream of the 
integration site. Unlike the first fusion transcript, transcription of this transcript is under the 
control of a vector-borne promoter (such as the PGK promoter), and the corresponding 
mRNA is generated by splicing between the splice donor (SD) and splice acceptor (SA) sites 
as shown in Figure IB. To facilitate isolation of the trapped genes, cDNA was generated by 
reverse transcribing isolated RNA from pools of human teratocarcinoma cells that have 
undergone independent gene trap events. Based on the unique sequences present in the first 
exon of the murine btk gene, selective cloning of the fusion transcript is achieved as shown 
in Figure ID and as described on page 76, line 1 to page 77, line 2 of the specification. 

Teratocarcinoma cells are the "stem cells" that occur in unusual germ cell 
tumors and represent a good model for molecular mechanisms of embryonic development 
and differentiation. These cells generate almost any kind of tissues such as teeth, hair, bone, 
muscle, and cartilage. Stem cells possess the ability both to produce identical daughter cells 
(self-renewal), and to produce progeny with more restricted fates (commitment and 
differentiation). This property of stem cells underpins growth and diversification during 
development and sustains homeostasis and repair processes throughout adult life. An 
understanding of molecular mechanisms which govern stem cell fate is therefore of 
fundamental significance in cell and developmental biology and the capabilities arising from 
such knowledge have major biomedical applications. 

Example 6.1 (pages 74-80; Figures 1 A- ID) demonstrated the identification of 
oligonucleotides and polynucleotides from human teratocarcinoma cells comprising the 
claimed nucleic acid sequences of SEQ ID NOS:9-18. 

Appellant submits that the gene trap method enriches for a class of genes that are 
not required for teratocarcinoma cell viability and are likely to be involved in late stages of 
cellular differentiation and development. As such, the claimed polynucleotides and 
oligonucleotides are preselected and do not belong to the broad class of random DNA in the 
genome. Rather, they belong to a subset within the broad class of genes. 
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VI. ISSUES 



The following issues are presented for review in this appeal: 

A. UTILITY 

(1) Whether claims 1, 3, 4, 10, and 12 lack patentable utility under 35 U.S.C. 
§101 for the lack of a specific, substantial, and credible utility. In the Office Actions dated 
November 5, 2002, July 2, 2003, and an advisory action dated January 30, 2004, the 
Examiner contended: 

(a) that claims 1, 3, 4, 10, and 12 are not supported by a specific asserted 
utility because the disclosed uses of the nucleic acids are not specific 
and are generally applicable to any nucleic acid; 

(b) that the claimed invention are not supported by a substantial utility 
because no substantial utility has been established for the claimed 
subject matter; and 

(c) since the claimed invention is not supported by a specific and 
substantial asserted utility, credibility has not been assessed. 

As discussed below, the Examiner's contentions are in error, and the rejection 
should be reversed. 

(2) Whether claims 1, 3, 4, 10, and 12 lack patentable utility under 35 U.S.C. 
§ 1 12, first paragraph. In the Office Actions dated November 5, 2002, July 2, 2003, and the 
Advisory Action dated January 30, 2004, the Examiner contended that since claims 1, 3, 4, 
10, and 12 are not supported by either a specific or substantial utility or a well established 
utility, one skilled in the art would not know how to use the claimed invention. 

As discussed below, the Examiner's contention is in error, and the rejection 
should be reversed. 

B. WRITTEN DESCRIPTION 

Whether claims 1, 3, 4, 10, and 12 contain subject matter that was not 
described in the specification in such a way as to reasonably convey to one skilled in the 
relevant art that the inventors, at the time the application was filed, had possession of the 
claimed invention under 35 U.S.C. § 1 12, second paragraph. In the Office Actions dated 
November 5, 2002, July 2, 2003, and the Advisory Action dated January 30, 2004, the 
Examiner contended that while the specification discloses SEQ ID NOS: 9-18, the 
specification provides insufficient written description to support the genus of nucleotide 
sequences that comprise SEQ ID NOS: 9-18 which are encompassed by claims 1,3,4, 10, 
and 12. 
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As discussed below, the Examiner's contention is in error, and the rejection 
should be reversed. 

VIL GROUPING OF CLAIMS 

A. UTILITY UNDER 35 U.S.C. § 101 

Claims 1, 3, 4, 10, and 12 stand rejected under 35 U.S.C. § 101 for the lack of 
utility. Appellant believes that with regard to the issue of utility under 35 U.S.C. § 1 01, 
claims 1, 3, 4, 10, and 12 stand or fall together. 

B. UTILITY UNDER 35 U.S.C. § 112 

Claims 1, 3, 4, 10, and 12 stand rejected under 35 U.S.C. § 1 12, first 
paragraph, for lack of utility. Appellant believes that with regard to the issue of utility under 
35 U.S.C. § 1 12, first paragraph, claims 1, 3, 4, 10, and 12 stand or fall together. 

C. WRITTEN DESCRIPTION 

Claims 1,3,4, 10, and 12 stand rejected under 35 U.S.C. § 1 12, second 
paragraph, for lack of written description. Appellant believes that with regard to the issue of 
written description under 35 U.S.C. § 1 12, first paragraph, claims 1, 3, 4, 10, and 12 stand or 
fall together. 

VIII. ARGUMENTS 

A. UTILITY OF THE REJECTED CLAIMS 

Claims 1,3,4, 10, and 12 are drawn to oligonucleotides or polynucleotides 
that comprise the nucleotide sequences of SEQ ID NOS: 9-18. These claims have been 
rejected under 35 U.S.C. § 101. 

According to 35 U.S.C. § 101, whoever invents or discovers any new and 
useful process, machine, manufacture, or composition of matter may obtain a patent therefor 
subject to the conditions and requirements of 35 U.S.C. The threshold of utility is not high. 
Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700, 1702 (Fed. Cir. 1999). An 
invention is ct useful" under 35 U.S.C. § 101 if it is capable of providing some identifiable 
benefit. Id. (citing Brenner v. Manson, 383 U.S. 519, 534, 148 USPQ 689, 695 (1966)). 
Additionally, the Federal Circuit has stated that "(0° violate § 101 the claimed device must 
be totally incapable of achieving a useful result." Brooktree Corp. v. Advanced Micro 
Devices, Inc., 977 F.2d 1555, 1571, 24 USPQ2d 1401 (Fed. Cir. 1992), emphasis added. 
Cross v. Iizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); "Cross") states "any 
utility of the claimed compounds is sufficient to satisfy 35 U.S.C. § 101". Cross at 748, 
emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything 
under the sun that is made by man" is patentable ( State Street Bank & Trust Co. v. 
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Signature Financial Group Inc., 149 F.3d 1368, 47 USPQ2d 1596, 1600 (Fed Cir. 1998), 
citing the U.S. Supreme Court's decision in Diamond vs. Chakrabarty, 447 U.S. 303, 206 
USPQ193 (U.S., 1980)). 

It has been clearly established that a statement of utility in a specification 
must be accepted absent reasons why one skilled in the art would have reason to doubt the 
objective truth of such statement. In re hanger, 503 F.2d 1380, 1391, 183 USPQ 288, 297 
(CCPA, 1974); In re Marzocchi, 439 F2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971). 
The specification provides numerous specific, substantial, and credible utilities for the 
claimed nucleic acids comprising SEQ ID NOS:9-18. 

1. THE REJECTED CLAIMS HAVE SPECIFIC UTILITY 

The Examiner has based the rejection of claims 1,3,4, 10, and 12 on the 

contentions that the disclosed uses of the nucleic acids are not specific and are generally 

applicable to any nucleic acid. Appellant has presented arguments in amendments filed 

April 7, 2003 and December 31, 2003 regarding the fact that the rejected claims have 

specific utility. However, the Examiner has not provided any counter arguments refuting 

Appellant's position. Instead, in the Advisory Action dated January 30, 2004, the Examiner 

contends that the invention does not have a specific utility because the claimed nucleic acids and 

oligonucleotides lack a substantial, immediately apparent utility. Specifically, the Advisory 

Action states on page 2: 

"On page 2, Applicants argue that the claimed oligonculetides and 
polynucleotides do not have a general utility but a specific utility because the 
gene trap method, "enriches for a class of genes that are not required for 
tertocarcinoma cell viability and are likely to be involved in late stages of 
cellular differentiation and development (page 2, Response). As Office 
Action mailed on July 2, 2003 already addressed, the claimed nucleic acids 
and oligonucleotides lacks a substantial, immediate apparent, utility." 

Appellant submits that the Examiner did not address whether the arguments 
presented by the Appellant regarding specific utility are persuasive. This is improper, since 
the requirement for specific utility is separate from that of substantial utility. Below is a 
summary of arguments presented in responses filed April 7, 2003 and December 31, 2003 
regarding specific utility of the present invention. 

Regarding specific utility, the Revised Interim Utility Guidelines Training 
Materials define it as follows: 

Specific utility is a utility that is specific to the subject matter 
claimed. This contrasts with a general utility that would be applicable to the 
broad class of the invention. For example, a claim to a polynucleotide whose 
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use is disclosed simply as a "gene probe" or "chromosome marker" would 
not be considered to be specific in the absence of a disclosure of a specific 
DNA target. Similarly, a general statement of diagnostic utility, such as 
diagnosing an unspecified disease, would ordinarily be insufficient absent a 
disclosure of what condition can be diagnosed." 

(Http : //w w w/uspto . go v/ web/menu/utility) . 

Unlike the example cited in the above definition where any fragment of 
genomic DNA can in theory be used as a probe or a chromosome marker, the polynucleotide 
sequences of SEQ ID NOS: 9-18 have utilities that are not common to any random gene in 
the genome. Appellant submits that the polynucleotide sequences of SEQ ID NOS: 9-18 
have specific utilities which stem from their cellular origin and the identification process. 
As explained in the Summary of The Invention hereinabove, gene trap vectors were 
introduced into human teratocarcinoma cells which led to the identification of gene loci that 
comprise the sequences set forth in SEQ ID NOS: 9-18. In particular, as the gene trap 
vector were introduced into the human teratocarcinoma cell, they integrated into the cell's 
genome resulting in gene fusions. Each fusion produces a transcript that comprises one or 
more exons that are located either upstream or downstream from the integration site. These 
exons, which are portions of a genetic locus that was disrupted by a gene trap vector, are 
represented by the presently claimed oligonucleotides and polynucleotides. 

Appellant emphasizes that the sequences set forth in SEQ ID Nos: 9-18 are 
not picked from the human genome randomly, rather, they represent a selection of genetic 
sequences that play a role in the later stages of cellular differentiation and development. 
Genes that are critically essential to the survival of teratocarcinoma cells would not have 
been isolated and propagated by the gene trap methods of the invention, as cells bearing 
disruptions in such a class of genes would not have been able to survive after transfection 
with the gene trap vector. Accordingly, the utility of these sequences are not general 
because not every gene in the genome play a role in the later stages of cellular differentiation 
and development as provided by the oligonucleotides and polynucleotides of the invention. 

Appellant respectfully points out that the genetic loci in the teratocarcinoma 
cells which have been identified by the gene trap vectors fall within a specific class of genes 
which are distinct from the broad general class of genes in the genome. Apparently, these 
identified genetic loci encode genetic functions, both copies of which are not critically 
essential to the survival and growth of teratocarcinoma cells. After transfection with the 
gene trap vectors, the teratocarcinoma cells survived and propagated in culture with only one 
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fully functional allele of the genetic loci. Thus, these genetic loci and the products encoded 
by these loci are preselected by the transfection and the ensuing cell culture process for 
possessing functions involved in later stages of cellular differentiation and development. 

Appellant asserts that the identified sequences represent a specific class of 
genes that is involved in later stages of cellular differentiation and development. These 
genetic loci encode genetic functions that are not inhibitors of cell death or apoptosis and are 
not involved in the general survival, i.e., house-keeping functions, of teratocarcinoma cells 
because one functional allele of these genes does not trigger cell death or apoptosis and that 
one functional allele of these genes is sufficient for cell survival and growth. The usefulness 
of such genes is well-established in the art and are described in the originally filed 
specification, inter alia, at page 12, lines 16-24. Support regarding gene function of the 
presently claimed oligonucleotides and polynucleotides can be derived logically as 
explained below. 

The insertion of a gene trapping vector into a gene will interrupt the proper 
function of one out of two alleles of the gene. If the gene is an inhibitor of cell death or 
apoptosis and both alleles are required for normal function, the cell will die and be lost in a 
population, and the gene will not be identified by the present invention. If the gene is 
required for cell viability, this reduction of gene activity by 50% will in most cases result in 
a decrease in cell viability. Thus, in a population of cells exposed to the gene trapping 
vectors of the invention, the percentage of cells that can be identified as suffering from a 
50% reduction in gene activity of a gene required for cell viability is disproportionally lower 
than the percentage of cells that have a 50% reduction in gene activity of a gene not required 
for cell viability. On the other hand, in the same population, the percentage of cells with an 
insertion of the gene trap vector in a gene that is not required for cell viability will be higher 
than the percentage of cells that have a 50% reduction in gene activity of a gene in the 
genome that are required for cell viability. As the sequences of the invention are derived 
from the cells with insertions of the gene trap vector, the number of identified genes that are 
not required for cell viability will be higher compared to the number of identified genes that 
are required for cell viability. The gene-trapping method of the present invention therefore 
pre-selects a class of genes that is not involved in cell viability. Genes that are not involved 
in cell viability are likely to be involved in later stages of cellular differentiation and 
development. Thus, the gene trap method enriches a class of genes that is involved in later 
stages of cellular differentiation and development. 
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When a polynucleotide or oligonucleotide does not do a certain thing that is 
done by many genes within the genome, it takes the polynucleotide or oligonucleotide out of 
the broad class of random DNA in the genome. Accordingly, the claimed polynucleotides 
and oligonucleotides do not belong to the broad class of random DNA in the genome but 
belong to a subset within the broad class of genes. Thus, the polynucleotides and 
oligonucleotides do not have a general utility, but a specific utility. 

As discussed in the Amendment filed on April 7, 2003, Appellant submits that 
the gene trap method enriches for a class of genes that are not required for teratocarcinoma cell 
viability and are likely to be involved in late stages of cellular differentiation and development. 
Appellant submits that according to the Guidelines for the Utility Requirement (" Utility 
Guidelines"), 66 FR 1098 Jan. 5, 2001; MPEP 2107.01, a claim to a polynucloetide whose use is 
disclosed simply as a "gene probe" or "chromosome marker" would not be considered to be 
specific in the absence of a disclosure of a specific DNA target. According to the Utility 
Guidelines, a "specific utility" is specific to the subject matter claimed. This contrasts with a 
general utility that would be applicable to the broad class of the invention. The Utility 
Guidelines indicate that since any gene can be used as a "gene probe" or "chromosome marker", 
there is a lack of specific utility if there is no specific DNA target. Accordingly, any gene or 
fragment of DNA sequence that is present in the human genome would fall within this broad 
class of the invention. However, the claimed polynucleotides of the present invention can be 
used as a gene probe or chromosome marker specific for such genes that are of particular interest 
to scientists and medical practitioners studying the biology of cellular differentiation and 
development. While the asserted utility is not as narrowly defined as that of a correlation with a 
disease condition, and although the number of polynucleotides that have such a specific utility is 
relatively larger than that of polynucleotides associated with a Mendelian genetic disease, 
Appellant submits that, it is nevertheless not a general utility that would be applicable to the 
broad class of genes in the genome. As such, Appellant submits that the claimed invention 
meets the threshold requirement of having specific utility. 

The Examiner contends in the Office Action dated July 2, 2003 that neither the 
specification nor the response disclose any associated phenotypes for the claimed 
polynucleotides. The Examiner further contends in the Office Action that the specification does 
not disclose what the functions of the claimed polynucleotides are. Appellant submits that 
lacking assertion of utility does not mean the invention has no utility. According to MPEP 2107, 
if the applicant has not asserted any specific and substantial utility for the claimed invention, 
rejections under 35 U.S.C.§ 101 and 1 12 shift the burden of coming forward with evidence to 
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the applicant to: (i) explicitly identify a specific and substantial utility for the claimed invention; 
and (ii) provide evidence that one of ordinary skill in the art would have recognized that the 
identified specific and substantial utility was well-established at the time of filing. Specific and 
substantial utility for the claimed invention and evidence that one of ordinary skill in the art 
would have recognized that the identified specific and substantial utility was well-established at 
the time of filing were presented in the previously filed responses dated April 7, 2003 , December 
31, 2003 and the present Appeal Brief 

The Examiner alleges in the Office Action dated November 5, 2002 that 
Appellant has not disclose what roles do SEQ ID NOS:9-18 play in the later stages of 
cellular differentiation and development. Appellant submits that it is not necessary to 
disclose what such roles are in order to satisfy the specific utility requirement. Appellant 
submits that since the claimed oligonucleotides or polynucleotides of the present invention is 
not just any piece of nucleic acid, specific utility requirement has been satisfied. As clearly 
set forth by the Federal Circuit in Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 
(Fed. Cir. 1991): 

An invention need not be the best or only way to accomplish a certain 
result, and it need only be useful to some extent and in certain applications: 
M [T]he fact that an invention has only limited utility and is only operable in 
certain applications is not grounds for finding a lack of utility." Envirotech 
Corp. v. Al George, Inc., 221 USPQ 473 , 480 (Fed. Cir. 1984). 

Just because the Appellant has not disclosed what precise biological roles do 
the presently claimed polynucleotides or oligonucleotides have in the later stages of cellular 
differentiation and development does not mean that the presently described polynucleotides 
or oligonucleotides lack utility. 

The present invention represents genes that need not have an easily 
observable phenotype. By conventional forward genetics, the cells are mutated and selected 
for an observable phenotype. Subsequently, the mutation is genetically mapped by 
following the phenotype. Based on the genetic map position, the gene is cloned. Without an 
observable phenotype, the mutation cannot be genetically mapped and the associated gene 
cannot be cloned. The gene trap method, in contrast, pre-selects for a class of genes that is 
not required for cell viability, and thus effectively narrows the scope of the identification 
process. 

Appellant further submits that the claimed oligonucleotides and 
polynucleotides are specifically identified and functionally validated exons (i.e., exons 
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which had been actually spliced during post-transcriptional processing) that would not have 
been identified by conventional molecular biology approaches. Exhibit B shows the 
sequence alignments of SEQ ID NOS:9-l 8 with human genomic sequences in GenBank. As 
set forth in the specification, inter alia, at page 12, line 12, the present invention provides 
tools for identifying exon splice junction, chromosome mapping, etc. This is precisely the 
utility of the present invention as set forth throughout the specification as originally filed. 
The specification, inter alia, at page 20, lines 12-15, describes that the claimed 
oligonucleotides or polynucleotides from the intron/exon boundaries of the human gene can 
be used to design primers for use in amplification assays to detect mutations within the 
exons, introns, splice sites (e.g., splice acceptor and/or donor sites) that can be used in 
diagnostics. For example, as shown in Exhibit B, Applicants submit that SEQ ID NO: 16 
defines a coding region since SEQ ID NO: 16 spans four distinct exons on chromosome 22 
(bases 90259 to 90085; bases 90631 to 90504; bases 89080 to 89014; and bases 94134 to 
94085 from Genbank accession number AL021391 which is a clone from chromosome 22) 
that are separated by introns (bases 89081 to 90084; bases 90260 to 90503; and bases 90632 
to 94084 from Genbank accession number AL021391). 

Appellant points out that only a small percentage (2-4%) of the human 
genome actually encodes exon sequences, and these exons are widely interspersed within a 
given chromosome. When the gene comprising these exons are expressed, the cell must clip 
out these exons and assemble them end-to-end in order to produce a functional mRNA 
which acts as a template for the translation of a protein product. The claimed 
oligonucleotides or polynucleotides comprising the sequence of SEQ ID NOS:9-18 encode 
exons that are actually spliced together to produce an active functional transcript (/. e. , one of 
the utilities of the described sequences is for defining intron/exon splice-junctions). Exon 
splice junctions are particularly important in the study of disease and cancer because splice 
junctions can often be hot spots for erroneous events leading to these disease states. 
Appellant respectfully submits that the practical scientific value of biologically validated, 
expressed, spliced, and polyadenylated mRNA sequences is readily apparent to those skilled 
in the relevant biological and biochemical arts. 

For further evidence in support of the Appellant's position, section 3 of 
Ventner et al, 2001, Science 291 :1304 (Exhibit C) particularly Fig. 1 1 at pp.1324-1325, 
demonstrates the significance of expressed sequence information in the structural analysis of 
genomic data. The present polynucleotide sequence defines a biologically validated 
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sequence that provides a unique and specific resource for mapping the genome essentially as 
described in the Ventner et al. article. In disclosing a functionally validated exon splice 
junction, the claimed oligonucleotides or polynucleotides provide physical evidence that 
effectively trumps the hypothetical conclusions provided by bioinformatics analysis of the 
corresponding genomic region conducted without supporting physical data. Thus, the 
present sequence clearly meets the requirements of 35 U.S.C. § 101. 

Furthermore, the gene trapped sequences of the present invention overcome 
some of the limitations of conventional cDNA and expressed sequence tag libraries. In 
particular, the claimed oligonucleotides or polynucleotides were identified using gene trap 
vectors that are independent of the level of endogenous mRNA expression of a gene for 
identification of that gene. The gene trap vectors are able to trap poorly expressed genes. 

Still further, Appellant points out that each of the sequences of the present 
invention can be used to map a specific region on a specific human chromosome. The 
specificity of each of the claimed oligonucleotides or polynucleotides are listed below: SEQ 
ID NO:9 can be used to map a specific region of human chromosome 5, due to the fact that 
SEQ ID NO:9 aligns with two clones from chromosome 5 (Genbank accession numbers 
ACO 12640 and AC034241); SEQ ID NO: 10 can be used to map a specific region of human 
chromosome 12, due to the fact that SEQ ID NO: 10 aligns with a clone from chromosome 
12 (Genbank accession number AC140062); SEQ ID NO:l 1 can be used to map a specific 
region of human chromosome 4, due to the fact that SEQ ID NO:l 1 aligns with a clone 
from chromosome 4 (Genbank accession number AC1 12518); SEQ ID NO: 12 can be used 
to map a specific region of human chromosome 9, due to the fact that SEQ ID NO: 12 aligns 
with a clone from chromosome 9 (Genbank accession number AL1 58207); SEQ ID NO: 13 
can be used to map a specific region of human chromosome 10, due to the fact that SEQ ID 
NO: 13 aligns with a clone from chromosome 10 (Genbank accession number AL161936); 
SEQ ID NO: 14 can be used to map a specific region of human chromosome 11, due to the 
fact that SEQ ID NO: 14 aligns with a clone from chromosome 1 1 (Genbank accession 
numbers AC092768); SEQ ID NO: 15 can be used to map a specific region of human 
chromosome 12, due to the fact that SEQ ID NO: 15 aligns with a clone from chromosome 
12 (Genbank accession number AC0081 15); SEQ ID NO: 16 can be used to map a specific 
region of human chromosome 22, due to the fact that SEQ ID NO: 16 aligns with a clone 
from chromosome 22 (Genbank accession number AL021391); SEQ ID NO:17 can be used 
to map a specific region of human chromosome 1 8, due to the fact that SEQ ID NO: 1 7 
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aligns with a clone from chromosome 18 (Genbank accession number ACO 15933); SEQ ID 
NO: 18 can be used to map a specific region of human chromosome 1, due to the fact that 
SEQ ID NO: 18 aligns with a clone from chromosome 1 (Genbank accession number 
AL360270). Exhibit B shows the sequence alignments of SEQ ID NOS:9-18 with Genbank 
human genomic sequences. The presently claimed oligonucleotides or polynucleotides have 
specific utility in mapping the protein encoding regions of the corresponding human 
chromosome, as described in the specification, inter alia, at page 12, line 12. The exquisite 
specificity of each of the claimed oligonucleotides or polynucleotides for their specific locus 
on a corresponding human chromosome is evidenced by the fact that each of the claimed 
oligonucleotides or polynucleotides do not specifically align with any other human genomic 
sequences. Hence, the claimed polynucleotides are not random fragments of genomic DNA 
of unknown location. Thus, the present sequence clearly meets the utility requirements of 
35U.S.C. § 101. 

While earlier mapping techniques have identified gross chromosomal 
positions for numerous disease-associated genes, these techniques are inadequate to 
precisely map these genes. However, using the presently described nucleotide sequence and 
a computer system, the exact location of such disease-associated genes is able to be 
specifically pinpointed, as detailed above. The claimed oligonucleotides or polynucleotides 
provide exquisite specificity in localizing the specific region of a particular human 
chromosome that contains the gene encoding the given polynucleotide, a utility not shared 
by virtually any other nucleic acid sequences. In fact, it is this specificity that makes this 
particular sequence so useful. Early gene mapping techniques relied on methods such as 
Giemsa staining to identify regions of chromosomes. However, such techniques produced 
genetic maps with a resolution of only 5 to 1 0 megabases, far too low to be of much help in 
identifying specific genes involved in disease. The skilled artisan readily appreciates the 
significant benefit afforded by markers that map a specific locus of the human genome, such 
as the present oligonucleotides or polynucleotides. 

As the present oligonucleotides or polynucleotides are specific markers of the 
human genome, and such specific markers are targets for the discovery of drugs that are 
associated with human disease, those of skill in the art would instantly recognize that the 
present nucleotide sequence would be an ideal, novel candidate for assessing gene 
expression. Thus, the present claims clearly meet the requirements of 35 U.S.C. § 101 . 
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Since the Appellant has asserted specific and substantial utility for the 
claimed invention, inter alia, on page 12, lines 8 to 24 of the specification, the Examiner is 
required to establish a prima facie case for lack of specific and substantial utility. The 
Guidelines for the Utility Requirement provides that where the asserted utility appears not to 
be specific or substantial, a prima facie showing must establish that it is more likely than not 
that a person of ordinary skill in the art would not consider that any utility asserted by the 
applicant would be specific and substantial. The prima facie showing must contain the 
following elements (see MPEP 2107(II)(C)(1) and 2107.02(IV)) : (1) an explanation that 
clearly sets forth the reasoning used in concluding that the asserted utility for the claimed 
invention is not both specific and substantial nor well-established; (2) support for factual 
findings relied upon in reaching this conclusion; and (3) an evaluation of all relevant 
evidence of record, including utilities taught in the closest prior art. The Examiner has not 
provided any factual findings in which the conclusion for lack of specific and substantial 
utility is relied upon, nor has the Examiner evaluated utilities taught in the closest prior art. 
Accordingly, the Examiner has not provided a prima facie showing that the invention does 
not have specific and substantial utility. The rejection is thus in error and should be 
withdrawn. 

The utilities of the claimed oligonucleotides and polynucleotides are further 
discussed hereinbelow where it is shown that the utilities are substantial and credible. 



2. THE REJECTED CLAIMS HAVE SUBSTANTIAL AND CREDIBLE UTILITY 

Appellant submits that the specification provides numerous substantial and 
credible utilities for polynucleotides or oligonucleotides comprising SEQ ID NOS:9-18. 

The Examiner contends in the Advisory Action dated January 30, 2004 that 
the present invention lacks substantial utility because further research is needed to 
reasonably identify a utility. The Examiner further contends that the present invention lacks 
substantial utility because "the fact that medical practitioner can use the claimed invention 
for 'studying the biology of cellular differentiation and development'". 

Substantial utility is defined as: 

"a utility that defines a "real world" use. Utilities that require or 
constitute carrying out further research to identify or reasonably 
confirm a "real world" context of use are not substantial utilities." 
(MPEP Section 2107.01) 
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An invention nevertheless has substantial utility even though further research 
needs to be performed. For example, an assay method for identifying compounds that 
themselves have a "substantial utility" define a "real world" context of use. Also, an assay 
that measures the presence of a material which has a stated correlation to a predisposition to 
the onset of a particular disease condition would also define a "real world" context of use in 
identifying potential candidates for preventive measures or further monitoring (MPEP 
Section 2107.01). Thus, substantial utility is not precluded by further experimentations. 

Appellant submits in the amendment filed April 7, 2003 that, among other 
uses, the polynucleotides of the present invention may be used in the context of a 
hybridization assay, e.g., in the format of a microarray. Instead of using the entire universe 
of genes in the genome in such an experiment, the skilled person has the option of limiting 
the experiment to using polynucleotides of the invention in the microarray. In effect, genes 
that are critically essential to the survival and early growth of teratocarcinoma cells would, 
be excluded from the microarray. Thus, further research is not performed to identify or 
reasonably confirm the asserted utility but to state a correlation between a gene and a 
particular stage in cellular differentiation and development. Thus, such experimentation 
does not preclude finding of substantial utility. 

The Examiner contends in the Advisory Action dated January 30, 2004 that 
according to MPEP 2107.01(a), a basic research tool such as studying the properties of the 
claimed product itself or the mechanisms in which the material is involved lacks substantial 
utility. Appellant argued in the response filed April 7, 2003 that the use of polynucleotides 
of the present invention help cut down the total number of genes that needs to be studied and 
simplify the work of a biologist who uses the presently claimed invention to study 
embryonic cell differentiation and development, such experimentation does not precluding 
the finding of substantial utility. The biologist is not limited to using the presently claimed 
invention to study the properties of the claimed product itself or the mechanisms in which 
the material is involved. 

Appellant submits that the guidelines cautioned not to interpret "immediate 
benefit to the public" to mean that products or services based on the claimed invention must 
be "currently available" to the public in order to satisfy the utility requirement. Brenner v. 
Manson, 383 U.S., 519, 534-35, 148 USPQ 689, 695 (1966). Rather, any reasonable use 
that an applicant has identified for the invention that can be viewed as providing a public 
benefit should be accepted as sufficient, at least with regard to defining a "substantial" 
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utility. Here, the set of genes that are enriched for their lack of involvement in cell viability 
and their likelihood of participating in the later stages of cellular differentiation and 
development represents substantial utility to biologists who are studying late stages of 
cellular differentiation and development. The preselected set of genes are currently 
available and will immediately provide, at a minimum, the economic benefit of not having to 
put every gene in the genome on microarray(s). Accordingly, the present invention has 
utility that provide immediate benefit to biologists, which exceeded the requirement of 
finding substantial utility. 

Some confusion can result when one attempts to label certain types of 
inventions as not being capable of having a specific and substantial utility based on the 
setting in which the invention is to be used. Many research tools such as screening assays, 
and nucleotide sequencing techniques have a clear, specific and unquestionable utility (e.g., 
they are useful in analyzing compounds) (MPEP Section 2107.01). Although the present 
invention may be used in a research setting by a biologist, such use of the presently claimed 
invention for further research should not render the present invention lacking substantial 
utility. 

Appellant submits that the above described utilities are well known in the art, 
and hence utilities of the present invention are credible. As stated in the Examination 
Guidelines for the Utility Requirement, credibility is assessed from the perspective of one of 
ordinary skill in the art in view of the disclosure or any other evidence of record (66 FR 
1098, Jan 5, 2001). Accordingly, not only do the oligonucleotides and polynucleotides of 
the present invention have specific utilities, their utilities are credible and practical. 

In view of the foregoing, Appellant submits that the claimed invention has 
specific, substantial and credible utility. 

B. THE REJECTED CLAIMS HAVE UTILITY 
UNDER 35 U.S.C. § 112, FIRST PARAGRAPH 

Claims 1, 3, 4, 10, and 12 are rejected under 35 U.S.C. § 1 12, first paragraph, 
as allegedly lacking utility. 

The Federal Circuit and its predecessor have determined that the utility 
requirement of Section 101 and the how to use requirement of Section 1 12, first paragraph, 
have the same basis - the disclosure of a credible utility. See In re Brana, 51 F.3d 1560, 
1564, 34USPQ2d 1436, 1441 (Fed. Cir. 1995); see also In reJolles, 628 F.2d 1322, 1326 n. 
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11, 206USPQ 885, 889 n. 11 (CCPA 1980); and In re Fouche, 439 F.2d 1237, 1243, 169 
USPQ 429, 434 (CCPA 1971). 

Appellant traverses this rejection on the ground that Claims 1, 3, 4, 10, and 
12 have significant patentable utility as discussed in Section A, above. Appellant submits 
that when an Appellant satisfactorily rebuts a rejection based on a lack of utility under 35 
U.S.C. § 101, the corresponding rejection imposed under 35 U.S.C. § 1 12, first paragraph, 
should also be withdrawn. 

C* THE REJECTED CLAIMS AND THE SPECIFICATION 
MEET THE WRITTEN DESCRIPTION REQUIREMENT 

Claims 1, 3, 4, 10, and 12 are rejected under 35 U.S.C. § 1 12, first paragraph, 
as allegedly containing subject matter that was not described in the specification in such a 
way as to reasonably convey to one skilled in the relevant art that the inventors, at the time 
the application was filed, had possession of the claimed invention. 

The Examiner alleges in the Advisory Action dated January 30, 2004 and in 

the Office Action dated July 2, 2003 that the disclosed subgenus and species embraced by 

the claims are not representative of the entire genus being claimed and that the specification 

does not disclose full-length cDNA or open reading frames (ORFs). The Examiner further 

alleges that disclosure of a partial sequence of otherwise uncharacterized nucleic acid 

molecules that may encode corresponding protein is insufficient to establish possession of a 

broad genus solely on the description of the partial sequence, where the broad genus 

embraces the uncharacterized nucleic acid molecules by default. This requirement is 

contrary to the requirement under case law that "Although [the applicant] does not have to 

describe exactly the subject matter claimed ... the description must clearly allow persons of 

ordinary skill in the art to recognize that [he or she] invented what is claimed." Vas-Cath v. 

Mahurkar, 19 USPQ2d 1111,1117 (Fed. Cir. 1991). An adequate description of a chemical 

genus requires a precise definition by structure, formula, chemical name or physical 

properties sufficient to distinguish the genus from other materials. Fiers v. Revel, 25 

USPQ2d 1601, 1606 (Fed. Cir. 1993). The standard for claims involving chemical materials 

has been explicitly stated by the Federal Circuit: 

In claims involving chemical materials, generic formulae 
usually indicate with specificity what the generic claims 
encompass. One skilled in the art can distinguish such a 
formula from others and can identify many of the species that 
the claims encompass. Accordingly, such a formula is 
normally an adequate description of the claimed genus. Univ. 
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of California v. Eli Lilly and Co., 43 USPQ2d 1398, 1406 
(Fed. Cir. 1997). 

Case law supports the fact that for chemical material, when one skilled in the 
art can distinguish a formula from others and can identify many of the species that the 
claims encompass, there is adequate description of the claimed genus. Hybritech v. 
Monoclonal Antibodies, 802 F.2d 1367, 1384, 231 USPQ 81, 94; Fonar Corp. v. General 
Electric Co., 107 F.3d 1543, 1549, 41 USPQ2d 1801, 1805. Thus, a claim describing a 
genus of nucleic acid by structure, formula, chemical name sufficient to distinguish the 
genus from other materials meets the written description requirement of 35 U.S.C. § 1 12, 
first paragraph. By virtue of the sequences recited in claims 1, 3, 4, 10, and 12, the claimed 
isolated oligonucleotides and polynucleotides are fully described by structure, sufficient to 
distinguish the claimed isolated oligonucleotides and polynucleotides from other materials. 

For example, claim 1 recites an oligonucleotide that comprises a contiguous 
stretch of at least about 30 nucleotides of at least one of SEQ ID NOS:9, 10, 12, 13, 17, and 
18. Thus, one of skill in the art can readily distinguish the isolated oligonucleotide of claim 
1 from other materials by the description provided in claim 1 . Whether a particular nucleic 
acid sequence comprises 30 contiguous nucleotides of at least one of SEQ ID NOS:9, 10, 12, 
13, 17, and 18 can be determined by the skilled artisan by sequence analysis. Merely 
because the sequences may contain sequences in addition to at least about 30 contiguous 
nucleotides of at least one of SEQ ID NOS:9, 10, 12, 13, 17, and 18, the claim should not be 
rejected for lack of written description. Here, the new aspect of the claimed isolated 
polynucleotides is the stretch of at least about 30 contiguous nucleotides of at least one of 
SEQ ID NOS:9, 10, 12, 13, 17, and 18, which is unambiguously described in the application 
by virtue of the sequence listing. Claim 3 recites nucleotides that comprise a contiguous 
stretch of at least about 60 nucleotides of at least one of SEQ ID NOS: 9, 10, 12, 13, 14, and 16- 
18. As the exact structure of SEQ ID NOS: 9, 10, 12, 13, 14, and 16-18 are provided in the 
specification, although there are numerous polynucleotides that falls within this description, a 
person of skill in the art can readily recognizes the polynucleotide as described in claim 3. 
Applicants submit that the written description requirement for the claimed genus of 
molecules in claims 1, 3, 4, 10, and 12 are met. 

In view of the foregoing, Appellant respectfully requests that the rejection of 
Claims 1,3,4, and 10 under 35 U.S.C. § 1 12, first paragraph, be withdrawn. 
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IX. CONCLUSION 

For the reasons set forth above, Appellants respectfully request that the 
rejection of the claims on appeal under 35 U.S.C. §§ 101 and 1 12 be reversed. 



Respectfully submitted, 



Date: July 28. 2004 30.742 



Laura A. Coruzzi y/ (Reg. No.) 

Jones Day 



222 East 41 st Street 



New York, New York 10017-6702 
(212)326-3939 



Enclosures 
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EXHIBIT A: APPENDIX TO APPELLANTS' BRIEF ON APPEAL 



CLAIMS ON APPEAL 
Serial No. 09/398,253 
Attorney Docket No. 8535-026 

1 . (Three Times Amended) An oligonucleotide comprising a contiguous stretch of 
at least about 30 nucleotides of at least one of SEQ ID NOS:9, 10, 12, 13, 17, and 18. 

3. (Three Times Amended) An isolated polynucleotide comprising a contiguous 
stretch of at least about 60 nucleotides of at least one of SEQ ID NOS:9, 10, 12, 13, 14, and 
16-18. 

4. (Amended) The isolated polynucleotide according to Claim 3, wherein said 
polynucleotide sequence comprising at least one of SEQ ID NOS:9-l 8. 

10. (Twice Amended) An oligonucleotide comprising a contiguous stretch of at 
least about 20 nucleotides of SEQ ID NO: 16. 

12. (New) An isolated polynucleotide consisting essentially of a contiguous stretch 
of at least about 125 nucleotides of SEQ ID NO:l 1 or 15. 



NYJD: 1527970.2 



) 



Query= SEQ ID NO : 9 

(171 letters) 

Score E 

Sequences producing significant alignments: (bits) Value 

AC012640 ACCESSION: AC012640 NID : gi 27356677 gb AC012640.12 Ho... 149 2e-33 
AC034241 ACCESSION:AC034241 NID: gi 17975241 gb AC034241.4 Horn... 149 2e-33 

>AC012640 ACCESSION:AC012640 NID: gi 27356677 gb AC012640.12 Homo 

sapiens chromosome 5 clone CTD-2256P15, complete sequence 
Length = 145122 

Score = 149 bits (75) , Expect = 2e-33 
Identities = 124/137 (90%), Gaps = 2/137 (1%) 
Strand = Plus / Plus 

Query : 2 0 tgtgaggacacagcnagaagcaagt c tntgcatgncnagaagaacggcct caacagacac 7 9 

IMIIIIIIIIIII I ! 1 1 i 1 1 1 1 1 I 1 1 1 1 I I MINI I i 1 1 11 1 1 i I II 1 1 i 

Sbjct : 54846 tgtgaggacacagcgagaagcaagtatctgcaagtcaagaagaaaggcctcaacagacac 54905 
Query: 80 canncctgccagcaccttgatcttgg-cttntggcctccagaactgtgaaagantaaaga 13 8 

II IMIII II lllllill I III II III Mill lllllllllllllllll 1 1 1 1 I 

Sbjct : 54906 cagccctgccagcaccttgatcttggacttctggcctccagaactgtgaaagaataaa-a 54964 
Query: 139 ttctgttgtttaagcca 155 

! I 1 1 1 ! I I ! I ! i I I ! I ! 

Sbjct: 54965 ttctgttgtttaagcca 54981 



>AC034241 ACCESSION: AC034241 NID: gi 17975241 gb AC034241.4 Homo 

sapiens chromosome 5 clone CTD-2360O20, complete sequence 
Length = 777 02 

Score = 149 bits (75) , Expect = 2e-33 
Identities = 124/137 (90%), Gaps = 2/137 (1%) 
Strand = Plus / Plus 

Query : 2 0 tgtgaggacacagcnagaagcaagtctntgcatgncnagaagaacggcct caacagacac 7 9 

i MM Ml! I M II llllllllll I MM I I lllllll 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 

Sbjct : 10337 tgtgaggacacagcgagaagcaagtatctgcaagtcaagaagaaaggcctcaacagacac 10396 
Query: 80 canncctgccagcaccttgatcttgg-cttntggcctccagaactgtgaaagantaaaga 13 8 

II I i M 1 1 1 1 1 1 ! i I i 1 1 1 1 1 1 ! I Ml II III MM I II Mill I M I 1 1 1 1 I 

Sbj ct : 103 97 cagccctgccagcaccttgatcttggacttctggcctccagaactgtgaaagaataaa-a 10455 
Query: 139 ttctgttgtttaagcca 155 

lllllllllllllllll 

Sbjct: 10456 ttctgttgtttaagcca 10472 



! 



Query= SEQ ID NO : 10 

(294 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AC140062 ACCESSION:AC140062 NID : gi 29150317 gb AC140062.il Ho... 351 5e-94 

>AC140062 ACCESSION: AC140062 NID: gi 29150317 gb AC140062.il Homo 

sapiens 12 BAC RP13-298C8 (Roswell Park Cancer Institute 
Human BAC Library) complete sequence 
Length = 64695 

Score = 351 bits (177) , Expect = 5e-94 
Identities = 180/181 (99%) 
Strand = Plus / Plus 

Query : 114 aggcactgggtaggaacacagccaagaacgattgcaggatgggtccttccaggacactga 173 

lllllllllllllllllMllllillillllllllllllllllll IIIIMIIlllllll 

Sbjct : 1688 aggcactgggtaggaacacagccaagaacgattgcaggatgggtccttccaggacactga 1747 
Query: 174 cgtctcagcttgcgcactgtgagtccctggacgagttactccacctctctgaacctcctc 233 

MIIMIMI IIIIIMiillllllMIIIIIIIIIIMMIMI llillilllllllll 

Sbjct : 1748 cgtctcagcttgcgcactgtgagtccctggacgagttactccacctctctgaacctcctc 1807 
Query : 234 ctcacttgcataatgggaaaaataatggacatagggagatgaaacaagaccttggagacc 2 93 

MINIUM IMIIIIiMIIMIMIIIIIII! IIIIIIM IIHIIIIIIIII II 

Sbjct : 1808 ctcacttgcataatgggaaaaataatggacataggaagatgaaacaagaccttggagacc 1867 
Query: 294 a 294 

I 

Sbjct: 1868 a 1868 



i 



) 



Query== SEQ ID NO: 11 

(241 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AC112518. 1.1.78409 426 e-117 

>AC112518. 1.1. 78409 

Length = 7 8409 

Score = 426 bits (215) , Expect = e-117 
Identities = 232/239 (97%) 
Strand - Plus / Minus 

Query : 3 atgccttctaaacagcctaccctgcccagngccatgattactgtgaccacatcttcagag 62 

I 1 I I I I I i l I I i I I i 1 I i 1 i 1 1 I I I II III III I Mill I III III I II I I Mill 

Sbj ct : 2737 atgcctt ctaaacagcctaccctgccaagtgccatgattactgtgaccacatct tcagaa 2678 
Query: 63 ccagaaaacaggatacctggccctaagcatgcactcatggagcanaagagttttaaatct 122 

II Mil! I M MM i Ml Ml MiMlll MM iiii.ii MM 1 1 1 1 1 1 1 II 1 1 1 1 1 1 

Sbj ct : 2677 ccagaaaacaggatacctggccctaagcatgcactcatggagcagaagagttttaaatct 2618 
Query: 123 gntatgccacagaagacagaagataacatgcttactacacttgtnaagcaacatgcagcc 182 

I MIMM MMIMMIMIMMIMMMMMIMIM 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 2617 ggaatgccacagaagacagaagataacatgcttactacacttgtaaagcaacatgcagcc 2558 
Query: 183 agccatttccagtgcaaattatctcattgcatagtgtgacaactaaaggtcataaccat 241 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 2557 agccatttccagtgcaaattatctcattgcatagtgtgacaactaaaggtcataaccat 2499 



Query= SEQ ID NO: 12 

(197 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AL158207 ACCESSION:AL158207 NID : gi 12717949 emb AL15 82 07 . 15 H... 391 e-106 

>AL158207 ACCESSION : AL15 82 07 NID: gi 12717949 emb AL158207.15 Human DNA 
sequence from clone RP11-4 09K2 0 on chromosome 9 Contains 
the TOR1B gene for torsin family 1 member B (torsin B) 
(DQ1) , the DYT1 gene for "dystonia 1, torsion" (autosomal 
dominant; torsin A) (DQ2 , TORI A) , the gene for 
hepatocellular carcinoma-associated antigen 59 (HSPC220, 
L0C51759) , the USP20 gene for ubiquitin specific protease 
20 (KIAA1003) , and the gene for f ormin-binding protein 17 
(FBP17, includes KIAA0554, FLJ13619, FLJ10754 and 
FLJ10113) . Contains ESTs , STSs, GSSs and four CpG islands, 
c 

Length = 169963 

Score = 391 bits (197) , Expect = e-106 
Identities = 197/197 (100%) 
Strand = Plus / Plus 

Query: 1 acaggatgcctgtaatcattattcagtgagcagcaacctgcagcagctcctcctgactgg 60 

II MINN lllll : IIIIIIIMIIIMIMIIIIIIIIIIM MINI llllllllll 

Sbj ct : 1373 96 acaggatgcctgtaatcattattcagtgagcagcaacctgcagcagctcctcctgactgg 137455 
Query: 61 cagatgggcctggcggccacccagaggctggggacacagcaagaatccagcacagcaccg 120 

IIIIIIIIIIMIIIIMMIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct: 137456 cagatgggcctggcggccacccagaggctggggacacagcaagaatccagcacagcaccg 137515 
Query: 121 atcccgattccctcctccccaaactacctgagccatggacctcattttgtggacaaaatt 180 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 137516 atcccgattccctcctccccaaactacctgagccatggacctcattttgtggacaaaatt 137575 
Query: 181 aaacttgccactttcac 197 

iiiiiiiiiiiiiii M 

Sbjct: 137576 aaacttgccactttcac 137592 



Query = SEQ ID NO: 13 

(387 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AL161936. 15. 1.155584 753 0.0 

>AL161936. 15. 1.155584 

Length = 155584 

Score = 753 bits (380), Expect = 0.0 
Identities = 385/387 (99%) 
Strand = Plus / Plus 

Query : 1 tggtgcttactaaaaattgaataancgtggaaaagagaaaatctccctctttaaaaggaa 60 

I I I I I I i i I I I i i I I I I I I I I. i i I I I I I I I I I I I I I I I I I i I i I i I i i I I I i I I i I I I I 

Sbjct : 144402 tggtgc ttactaaaaattgaataaacgtggaaaagagaaaatctccctct ttaaaaggaa 144461 
Query: 61 cactgttgtggacattttaaaatgcaaacgccttggctggaagtcagaaatcgtgttctc 12 0 

llllllll IIIMIIMIIIIIIIIIIIIIMIIIllltill IIIMIMMlii illl 

Sbj ct : 144462 cactgttgtggacatt ttaaaatgcaaacgccttggctggaagtcagaaatcgtgtt ct c 144521 
Query: 121 tctgctaaacctggtgtagcatttaacacgcttgaagtggaggcatctggtcaccaattt 180 

I I i 1 1 1 1 ] t M 1! 1 1 i M 1 1 M It I M 1 1 M 1 1 II i ! ! 1 1 M M 1 1 1 1 II I F M ! 1 L 1 1 1 

Sbjct: 144522 tctgctaaacctggtgtagcatttaacacgcttgaagtggaggcatctggtcaccaatt t 144581 
Query: 181 cacagcctggacagagcaagaaggtgcggctggcttaggaggcggcctgccgggggggat 24 0 

1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 ! i 1 1 i 1 1 i 1 1 1 i L 1 1 1 1 ! IIIIIIIIIIIMIIIMIMI III! 

Sbjct : 144582 cacagcctggacagagcaagaaggtgcggctggtttaggaggcggcctgccgggggggat 144641 
Query: 241 cgtctgtccatctgggcttggtaaatgtcaagggtcatttccctgtcctgacatttgatt 3 00 

IIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIII 

Sbj ct : 144642 cgtctgtccatctgggcttggtaaatgtcaagggtcatttccctgtcctgacatttgatt 144701 
Query: 3 01 gtgaagcaggttgcgaggtaactctttcaagggactggactgtgacagtcaccatagttg 360 

llllllll II III II 1 1 III I III II 1 1 II 1 1 MM III I II I II I II llllllll I III 

Sbjct : 144702 gtgaagcaggttgcgaggtaactctttcaagggactggactgtgacagtcaccatagttg 144761 
Query: 361 gacaataaaacccgaacatccttcacc 387 

1 1 1 1 1 1 1 ! 1 1 ! k 1 1 1 1 1 1 1 ! i 1 1 i 1 1 1 

Sbjct: 144762 gacaataaaacccgaacatccttcacc 144788 



Query= SEQ ID NO: 14 

(326 letters) 



Score E 

Sequences producing significant alignments: (bits)- Value 

AC092768 ACCESSION : AC092 768 NID : gi 18182777 gb AC092768.6 Horn... 466 e-128 

>AC092768 ACCESSION : ACO 92 7 68 NID: gi 18182777 gb AC092768.6 Homo 
sapiens chromosome 11, clone RP11-1149L18 , complete 
sequence 
Length = 146364 

Score = 466 bits (235), Expect = e-128 
Identities = 301/327 (92%), Gaps = 1/327 (0%) 
Strand = Plus / Minus 

Query: 1 ggacagtggctaactcagcagacnaaccacagcttcctgccctttgcagatggcntgaan 60 

IIIIIIIIIIIIIIIIIIIIIM Mill II MM II MINI MM MINI IMI 

Sbjct : 8644 ggacagtggctaactcagcagacgaaccagagcttcctgccctttgcagatggcatgaag 8585 
Query : 61 ataagagtttgccaaacaactaagatgggctcttgattgagcaaanaaaccacaacatgg 120 

II Ml I MM II MMM III I IIMIMM! I MM I II I Ml I III III I Mill 

Sbjct : 8584 ataagagt ttgccaaacaactaagatgggctct tgattgagcaaagaaaccacaacatgg 8525 
Query: 121 gacacacagagccaccctattgncctactgtcattcaagcttaaaggagacatatctaca 180 

1 1 1 1 1 1 I M I 1 1 ! 1 ! 1 1 1 I ! I ! M M I ! M 1 1 1 1 1 ! ! 1 1 M ! [ 1 1 M i 1 1 1 ! 1 1 ! 1 1 

Sbjct : 8524 gacacacagagccaccctattgccctactgtcattcaagcttaaaggagacatatctaca 8465 
Query: 181 gacagggtttgagcctagtnatggnganaactttcttggatgtctcaacancctgganat 240 

I M MM I II MMM Ml MM II 1 1 1 1 1 1 1 1 1 1 1 1 II I II I II II MIMI II 

Sbj ct : 8464 gacagggtttgagcctagtaatggtgagaact tt cttggatgtctcaacagcctggagat 8405 
Query : 241 gannntcccnacaaggcagaanancnaggtggnaca ttgntnntattgc tt tttat t -ca 2 99 

II MM I 1 1 1 1 1 1 1 i I I MINI MMM I Mill 1 1 1 1 1 1 1 1 II 

Sbjct: 8404 gaaattcccaagaaggcagaaaatagaggtggcacattggttttattgttttttattaca 8345 
Query: 300 attataaaagtaatgcatgctttttgt 326 

Mill MM II II Ml II II III II II 

Sbjct: 8344 attataaaagtaatgcatgctttttgt 8318 



Query= SEQ ID NO: 15 

(166 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AC008115 .3 . 1 . 158431 321 7e-86 

>AC008115.3 .1.158431 

Length = 158431 

Score = 321 bits (162) , Expect = 7e-86 
Identities = 165/166 (99%) 
Strand = Plus / Minus 



Query 1 tcagtatcctgacctggcaaggtgttccttaacctcccctctggatcccccttagcacac 60 

IIIIMIIIIIMMIIIIIIIIIIIIIMIIIIIIIMIIIIIIIMIIIIIIIIIIM 

Sbjct: 43020 tcagtatcctgacctggcaaggtgttccttaacctcccctctggatcccccttagcacac 42961 
Query 61 atctgggacaatggagcgttcagcaccacggacagcattacaccctcttcaagtgcttgt 12 0 

MIIIIMIIIiMIIIMIIillilMIIIIIIMIilliillllllllllllillili 

Sbjct : 42 960 atctgggacaatggagcgttcagcaccacggacagcattacaccctcttcaagtgcttgt 42 901 



Query: 121 taaggccatttgtctatttcactctcaagtaaataaaaatattttt 166 

III IMMIIIIMIIIIIIIIMMIIIIIIIIIIIIIIIIIII 

Sbjct: 42900 taaagccatttgtctatttcactctcaagtaaataaaaatatt ttt 42855 



Query= SEQ ID NO: 16 

(638 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AL021391 ACCESSION:AL021391 NID : gi 4467344 emb AL021391.2 HS10... 347 2e-92 

>AL021391 ACCESSION :AL021391 NID: gi 4467344 emb AL021391.2 HS102D24 

Human DNA sequence from clone RP1-102D24 on chromosome 22 
Contains a novel Mitosis -specific Chromosome Segregation 
protein SMC1 LIKE protein gene, a novel unknown gene, and 
the first coding exon of the FBLN1 gene for Fibulin 1. 
Contains ESTs, STSs, GSSs and putative CpG islands, 
complete sequence 
Length = 138129 

Score = 347 bits (175), Expect = 2e-92 
Identities = 175/175 (100%) 
Strand = Plus / Minus 

Query: 395 aggaggtggacagtgaacacagaaaagctgtaaggtgtcctgtgacagatgtatgtggtg 4 54 

1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 90259 aggaggtggacagtgaacacagaaaagctgtaaggtgtcctgtgacagatgtatgtggtg 90200 
Query: 455 gacacagcaggacccagaggaaggaagaaagaagctgctcttgaaaagaccctcaaacca 514 

MMIIIMII MINIM MMMIIIIillllllllll IIIIMIilllilllMMI 

Sbj ct : 90199 gacacagcaggacccagaggaaggaagaaagaagctgctcttgaaaagaccctcaaacca 90140 
Query: 515 cgatgctcaaggaagtgtcgagagatgaaggagaggtgtttgccaggcagagcag 569 

1 1 1 1 II I M 1 1 1 M M I M 1 ! 1 1 1 1 1 1! 1 1 1 1 1 1 1 1 1 1 1 1 1 M ! M 1 1 1 M 1 1 1 1 

Sbjct: 9013 9 cgatgctcaaggaagtgtcgagagatgaaggagaggtgtttgccaggcagagcag 90085 



Score = 248 bits (125) , Expect = le-62 
Identities = 127/128 (99%) 
Strand = Plus / Minus 

Query : 270 ggcctctgcgagactgtttcatagatgctcaagacaccagcaaaccagngccaccgaaca 32 9 

M MM II MM IMMMII II M MINIMUM M MIMIIIMI 

Sbj ct : 90631 ggcctctgcgagactgtttcatagatgctcaagacaccagcaaaccagtgccaccgaaca 90572 
Query: 3 30 agtatgagaaaagaacaggctagattatgttatccagaacttcacaaccatcagatctag 389 

INIIIINN NNNNNNINIIINI INN NNNINNNNNIIINI 

Sbjct: 90571 agtatgagaaaagaacaggctagattatgttatccagaacttcacaaccatcagatctag 90512 



Query: 3 90 acagaagg 3 97 

MINIM 

Sbjct: 90511 acagaagg 90504 

Score = 111 bits (56) , Expect = 2e-21 
Identities = 64/67 (95%) 
Strand = Plus / Minus 

Query: 568 agtagagacaagttttcgccatgttggtcaagctggtctcaaacttctaacctnacgtaa 627 

i i 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1! 1 1 1 1 1 1 1 1 1 k 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 MINI 

Sbjct: 89080 agtagagacaagttttcgccatgttggtcaggctggtctcaaactcctaacctcacgtaa 89021 
Query: 628 tccaccc 634 

IIMMI 

Sbjct: 89020 tccaccc 89014 

Score =75.8 bits (38), Expect = le-10 
Identities = 46/50 (92%) 
Strand = Plus / Minus 

Query: 219 ccaggttnnagtgattcccgtgcttcngnctcctgagaagctgggattac 268 

IIMMI 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I MIIIIIIMIIIIIIIIMI 

Sbjct: 94134 ccaggttcaagtgattcccgtgcttcagcctcctgagaagctgggat tac 94085 



Query= SEQ ID NO: 17 

(403 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AC015933. 9. 1.249021 668 0.0 

>AC015933. 9. 1.249021 

Length = 249021 

Score = 668 bits (337), Expect =0.0 
Identities = 383/402 (95%), Gaps = 3/402 (0%) 
Strand = Plus / Plus 

Query: 3 aaagagaaaaacaacattcaacancaacancaatttcccgaggatccctgcccacattca 62 

llllllllllllllll Mill Mill IMMIMMI MMMMMIMIUM 

Sbjct : 224797 aaagagaaaaacaacaa-caacaacaacaacaatttcccgaggatccctgcccacattca 224855 
Query: 63 nagt-gncacatttacctacttnanaggggagatnaaagccncactctaaggctccttat 121 

III I II llllllllllll I II MINI MINI llllllllllllilllll 

Sbjct : 224856 gagtag-cagatttacctacttcaaagtggagatcaaagccacactctaaggctccttat 224914 
Query: 122 ttccacaggctggnaagcaaacanggcntacaggctttgcangagtgtatcctaattctc 181 

MINIM Mi M IIIIMIII III lllllllllllll llllllllllllilllll 

Sbjct : 224915 ttccacaggctggcaagcaaacaaggcatacaggctttgcaagagtgtatcctaattctc 224974 
Query: 182 ttactgaagaaaagtcaacagcagagacancacagaaaaaggaatcaaagaggccaaatc 241 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct : 224975 ttactgaagaaaagtcaacagcagagacaacacagaaaaaggaatcaaagaggccaaatc 225034 
Query: 242 tgnggactcaaaacaataagaaaaaataaatcaactttgctaaaatttaagaatgccagg 3 01 

II IMIIMIIIIII IMIIIIIIIIIIMIMI MMMMMMMMIIIIIIMI ■ 

Sbjct : 225035 tgtggactcaaaacaataagaaaaaataaatcaactttgctaaaatttaagaatgccagg 225094 
Query: 3 02 ggggtaggtaaatgcactgggaagtatgtgtggactatgatgataataaatctcctttca 3 61 

MIMMIMMMIIMIIIMIMIMIIIMI IIIIIIIMIIIIIIIIIIIIIII 

Sbjct : 225095 ggggtaggtaaatgcactgggaagtatgtgtggac tatgatgataataaatctcctttca 225154 
Query: 362 atacaactgatatttatcagaccttgaataaaacactgaatg 403 

MIIIIIIIIIIIMIIMIMIIIIMIIIIIIMIIIIII 

Sbjct: 225155 atacaactgatatttatcagaccttgaataaaacactgaatg 225196 



1) 



Query = SEQ ID NO: 18 

(103 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AL360270 ACCESSION :AL3 6 02 7 0 NID : gi 11121069 emb AL360270.18 H. . . 198 le-48 



>AL360270 ACCESSION :AL3 6 02 7 0 NID: gi 11121069 emb AL360270.18 Human DNA 
sequence from clone RP11-96K19 on chromosome 1, complete 
sequence 
Length = 172805 

Score = 198 bits (100) , Expect = le-48 
Identities = 102/103 (99%) 
Strand = Plus / Plus 



Query: 1 actttctccaagctactcagaagactgaagcagaaggatcacttgaggccaggagttcaa 60 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 93618 actttctccaagctactcagaagactgaagcagaaggatcacttgaggccaggagttcaa 93677 
Query: 61 gatcagcctgagcaacatagngaaaccctatctctaaaaatac 103 

MillilM I Mill Mill IlillllMII MINI Ml II 

Sbjct: 93678 gatcagcctgagcaacatagtgaaaccctatctctaaaaatac 93720 



wmmmmmmmmmmmmmmmmmmm^m 



by-'. 



The Sequence of the Human Genome 

J. Craig Venter, 1 * Mark D. Adams, 1 Eugene W. Myers, 1 Peter W. Li, 1 Richard J. Mural, 1 
Granger G. Sutton, 1 Hamilton O. Smith, 1 Mark Yandell, 1 Cheryl A. Evans, 1 Robert A. Holt, 1 
Jeannine D. Gocayne, 1 Peter Amanatides, 1 Richard M. Ballew, 1 Daniel H. Huson, 1 
Jennifer Russo Wortman, 1 Qing Zhang, 1 Chinnappa D. Kodira, 1 Xiangqun H. Zheng, 1 Lin Chen, 1 

Marian Skupski, 1 Gangadharan Subramanian, 1 Paul D. Thomas, 1 Jinghui Zhang, 1 
George L Gabor Miklos, 2 Catherine Nelson, 3 Samuel Broder, 1 Andrew G. Clark, 4 Joe Nadeau, 5 
Victor A, McKusick, 6 Norton Zinder, 7 Arnold J. Levine, 7 Richard J. Roberts, 8 Mel Simon, 9 
Carolyn Slayman, 10 Michael Hunkapiller, 11 Randall Bolanos, 1 Arthur Delcher, 1 Ian Dew, 1 Daniel Fasulo, 1 
Michael Flanigan, 1 Liliana Florea, 1 Aaron Halpern, 1 Sridhar Hannenhalli, 1 Saul Kravitz, 1 Samuel Levy, 1 
. Clark Mobarry, 1 Knut Reinert, 1 Karin Remington, 1 Jane Abu-Threideh, 1 Ellen Beasley, 1 Kendra Biddick, 1 
Vivien Bonazzi, 1 Rhonda Brandon, 1 Michele Cargill, 1 Ishwar Chandramouliswaran, 1 Rosane Charlab, 1 
Kabir Chaturvedi. 1 Zuoming Deng, 1 Valentina Di Francesco, 1 Patrick Dunn, 1 Karen Eilbeck, 1 
Carlos Evangelista, 1 Andrei E. Gabrielian, 1 Weiniu Gan, 1 Wangmao Ge, 1 Fangcheng Gong, 1 Zhiping Gu, 1 
Ping Guan, 1 Thomas J. Heiman, 1 Maureen E. Higgins, 1 Rui-Ru Ji, 1 Zhaoxi Ke, 1 Karen A. Ketchum, 1 
Zhongwu Lai, 1 Yiding Lei, 1 Zhenya Li, 1 Jiayin Li, 1 Yong Liang, 1 Xiaoying Lin, 1 Fu Lu. 1 
Gennady V. Merkulov, 1 Natalia Milshina, 1 Helen M. Moore, 1 Ashwinikumar K Naik, 1 
Vaibhav A. Narayan, 1 Beena Neelam, 1 Deborah Nusskern, 1 Douglas B. Rusch, 1 Steven Salzberg, 12 
Wei Shao, 1 Bixiong Shue, 1 Jingtao Sun, 1 Zhen Yuan Wang, 1 Aihui Wang, 1 Xin Wang, 1 Jian Wang, 1 
Ming-Hui Wei, 1 Ron Wides, 13 Chunlin Xiao, 1 Chunhua Yah, 1 Alison Yao. 1 Jane Ye, 1 Ming Zhan, 1 
Weiqlng Zhang, 1 Hongyu Zhang, 1 Qi Zhao, 1 Liansheng Zheng, 1 Fei Zhong, 1 Wenyan Zhong, 1 
Shiaoping C Zhu, 1 Shaying Zhao, 1 . 2 Dennis Gilbert, 1 Suzanna Baumhueter, 1 Gene Spier, 1 
Christine Carter, 1 Anibal Cravchik, 1 Trevor Wood age, 1 Feroze Ali, 1 Huijin An. 1 Aderonke Awe, 1 

Danita Baldwin, 1 Holly Baden, 1 Mary Barnstead, 1 Ian Barrow, 1 Karen Beeson, 1 Dana Busam, 1 
Amy Carver, 1 Angela Center, 1 Ming Lai Cheng, 1 Liz Curry, 1 Steve Danaher, 1 Lionel Davenport, 1 
Raymond Desilets, 1 Susanne Dietz, 1 Kristina Dodson, 1 Lisa Doup, 1 Steven Ferriera, 1 Neha Garg, 1 
Andres Gluecksmann, 1 Brit Hart, 1 Jason Haynes, 1 Charles Haynes, 1 Cheryl Heiner, 1 Suzanne Hladun, 1 
Damon Hostin, 1 Jarrett Houck, 1 Timothy Howtand, 1 Chinyere Ibegwam, 1 Jeffery Johnson, 1 

Francis Kalush, 1 Lesley Kline, 1 Shashi Koduni, 1 Amy Love, 1 Felecia Mann, 1 David May, 1 
Steven McCawley, 1 Tina Mcintosh, 1 Ivy McMullen, 1 Mee Moy, 1 Linda Moy, 1 Brian Murphy, 1 
Keith Nelson, 1 Cynthia Pfannkoch, 1 Eric Pratts, 1 Vinita Puri, 1 Hina Qureshi, 1 Matthew Reardon, 1 
Robert Rodriguez, 1 Yu-Hui Rogers, 1 Deanna Romblad, 1 Bob Ruhfel, 1 Richard Scott, 1 Cynthia Sitter, 1 
Michelle Smallwood, 1 Erin Stewart, 1 Renee Strong, 1 Ellen Suh, 1 Reginald Thomas, 1 Ni Ni Tint, 1 
Sukyee Tse, 1 Claire Vech. 1 Gary Wang, 1 Jeremy Wetter, 1 Sherita Williams, 1 Monica Williams, 1 

Sandra Windsor, 1 Emily Winn-Deen, 1 Keriellen Wolfe, 1 Jayshree Zaveri, 1 Karena Zaveri, 1 
Josep F, Abril, 14 Roderic Guigo, 14 Michael J. Campbell, 1 Kimmen V. Sjolander. 1 Brian Karlak, 1 
Anish Kejariwal, 1 Huaiyu Mi, 1 Betty Lazareva, 1 Thomas Hatton, 1 Apurva Narechania, 1 Karen Diemer, 1 
Anushya Muruganujan, 1 Nan Guo, 1 Shinji Sato, 1 Vineet Bafna, 1 Sorin Istrail, 1 Ross Lippert, 1 
Russell Schwartz, 1 Brian Walenz, 1 Shibu Yooseph, 1 David Allen, 1 Anand Basu, 1 James Baxendale, 1 
Louis Blick, 1 Marcelo Caminha, 1 John Carnes-Stine, 1 Parris Caulk, 1 Yen-Hui Chiang, 1 My Coyne, 1 
Carl Dahlke, 1 Anne Deslattes Mays, 1 Maria Dombroski, 1 Michael Donnelly, 1 Dale Ely, 1 Shiva Esparham, 1 
Carl Fosler, 1 Harold Gire, 1 Stephen Glanowski, 1 Kenneth Glasser, 1 Anna Glodek, 1 Mark Gorokhov, 1 
Ken Graham, 1 Barry Gropman, 1 Michael Harris, 1 Jeremy Heil, 1 Scott Henderson, 1 Jeffrey Hoover, 1 
Donald Jennings, 1 Catherine Jordan, 1 James Jordan, 1 John Kasha, 1 Leonid Kagan, 1 Cheryl Kraft, 1 

Alexander Levitsky, 1 Mark Lewis, 1 Xiangjun Liu, 1 John Lopez, 1 Daniel Ma, 1 William Majoros, 1 
Joe McDaniel, 1 Sean Murphy, 1 Matthew Newman, 1 Trung Nguyen, 1 Ngoc Nguyen, 1 Marc Nodell, 1 
Sue Pan, 1 Jim Peck, 1 Marshall Peterson, 1 William Rowe, 1 Robert Sanders, 1 John Scott, 1 
Michael Simpson, 1 Thomas Smith, 1 Arlan Sprague, 1 Timothy Stockwell, 1 Russell Turner, 1 Eli Venter. 1 
Mei Wang, 1 Meiyuan Wen, 1 David Wu, 1 Mitchell Wu, 1 Ashley Xia, 1 Ali Zandieh, 1 Xiaohong Zhu 1 



1304 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 



THE HUMAN GENOME 

A 2 91-billion base pair (bp) consensus sequence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun sequencing 
method. The 14.8-billion bp DNA sequence was generated over 9 months from 
27 271 853 high-quality sequence reads (5.11-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies— a whole-genome assembly and a regional chromosome 
assembly— were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced/ without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group. This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and size of gaps in 
the final assembly over what would be obtained with 5.11 -fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromatic 
regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100,000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26,588 protein-encoding transcripts for which there was strong corroborating 
evidence and an additional ~ 1 2,000 computationally derived genes with mouse 
matches or other weak supporting evidence. Although gene-dense clusters are 
obvious, almost half the genes are dispersed in low G+C sequence separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons, whereas 24% is in introns. with 75% of the genome being 
intergenic DNA Duplications of segmental blocks, ranging in size up to chro- 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems. DNA 
sequence comparisons between the consensus sequence and publicly funded 
genome data provided locations of 2.1 million single-nucleotide polymorphisms 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity in the level of poly- 
morphism across the genome. Less than 1% of all SNPs resulted in variation in 
proteins, but the task of determining which SNPs have functional consequences 
remains an open challenge. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was first for- 
mally proposed in 1985 (1). In subsequent 
years, the idea met with mixed reactions in 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
.the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for deter mining the order of nucleotides of 



DNA using chain-terminating nucleotide ana- 
logs (5). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 
with this new technology (6). From early se- 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of the ex- 
pressed sequence tag (EST) method of gene 
identification (8), which is a random selection, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (9). The increasing numbers of hu- 
man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

The complete 49-kbp bacteriophage lamb-* 
da genome sequence was determined by a 
shotgun restriction digest method in 1982 
(11). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (12), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(23). The experience with several subsequent 
genome-sequencing efforts established the 
broad applicability of this approach (14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (16) of an approach to simulta- 
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neously map and sequence the humau 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAG end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome (19). 
• In 1 997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human "genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (25). Many of the principles of operation 
of a genome-sequencing facility were estab- 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
. thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation, with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to --5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- 
blies to report 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the —3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/c6ntent/full/291/ 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curarion and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome- Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome . 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods M 5 

Summary. This section discusses the rationale 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity along with 
the methodologies for DNA extraction and li- 
brary construction. The plasmid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni- 
form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent steps 
cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra- 
structure to enable efficient tracking of enor- 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
. quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and the 
World Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). ■ 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, —130 ml of whole, 
heparinized blood was collected. From males, 
—130 ml of whole, heparinized blood was 
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collected, as well as five specimens of semen, 
collected over a 6-week period. Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African- American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose DNA to 
sequence was based on a complex mix of fac- 
tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (33). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored ef- 
fectively (Fig. 2) (34), 

Current sequencing protocols are based on 



^ie dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reactioa This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. Trie DNA-sequencing facility is 
supported by a high-performance computation- 
al facility (36). 

The process for DNA sequencing was mod- 
ular by design and automated. Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefiilly 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drosophila project in May 
1999. The ABI 3700 is a fully automated 
capillary -array sequencer and as such can 
be operated with a minimal amount of 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels. 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



through the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before 
implementation, and production-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trimming, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



fable 1. Celera-generated data input into assembly. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project {26). By collecting data for the 



entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



) 

dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 
phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed in compliance with standard operating proce- 
dures, with a focus , on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are indicated and are 
described further in the text. 
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and provide a comparison to the public genome 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the —25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precisioa This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff" and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2. i Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we 
were able to characterize the range of insert 
sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7 X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
B AC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completioa Phase 0 data are a set 
of generally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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sequences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
tarninants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; (ii) the nohhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The. first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
. shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of the bactigs. This 
resulted in 16.05 rnillion "faux" reads that were 
sufficient to cover the genome 2.96 X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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iiiformation was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. Gen Bank data input into assembly. 



at least 2.2% of the BACs contained sequence 
data that were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 
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Total contaminant ma^tpH fhn\ 


q 


308 426 


27 781 




Average contig length (bp) 


0 


7,093 


66378 


ranger centre, ui\ 


Number of accession records 


0 


4,538 


2,599 




Number of contigs 


0 


74,324 


2,599 




Total base pairs 


0 


COO AfO CM 

089,059,692 


246,118,000 




Total vector masked (bp) 


0 


427,326 


25,054 






ft 


£,UOO,3U3 


O i 4, JO 1 




Average contig length (bp) 


0 


9,271 


94.697 


Others* 


Number of accession records 


42 


1,894 


3.458 




Number of contigs 


5,978 


29,898 


3,458 




Total base pairs 


5,564,879 


283,358,877 


246,474,157 




Total vector masked (bp) 


57,448 


279,477 


32.136 




Total contaminant masked 


575.366 


1,616,665 


1,791,849 




(bp) 










Average contig length (bp) 


931 


9,478 


71,277 


All centers combined! 


Number of accession records 


- 3.021 


21.015 


9,137 




Number of contigs 


258,943 


409,628 


9.137 




Total base pairs 


209,930,983 


3,360.047,574 


835.722,268 




Total vector masked (bp) 


1,655,293 


2,438,575 


82,284 




Total contaminant masked 


14,918,135 


16,311,664 


3,365,230 




(bp) 










Average contig length (bp) 


811 


8,203 


91,466 



*Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center 
Genomanalyse Cesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; 
Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence 
Livermore National Laboratory; Cold Spring Harbor Laboratory: Los Alamos National Laboratory; Max-Planck Instrtut fuer 
Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic 
Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas 
Southwestern Medical Center, University of Washington. fThe 4,405,700,825 bases contributed by all centers were 
shredded into faux reads resulting in 2.96X coverage of the genome. 



(see below). In short, we performed a true, ab 
initio whole-genome assembly in which we 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segments 
or "components" that could be determined with 
confidence, and then shotgun assembly was ap- 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux reads 
to ensure an independent ab initio assembly of 
the component By subsetting the data in this 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so mat the two assemblies could be com- 
pared for consistency. The quality of the parti- 
tioning into components was crucial so that 
different genome regions were not mixed to- 
gether. We constructed components from (i) the 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique 
to Celera' s data set The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5X Celera data mapped to those 
bactigs as input This effort was undertaken as 
an interim step solely because the more accurate 
and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on the 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components to 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored, 
and an independent, ab initio reconstruction of 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data and 
the shredded, faux reads of the partitioned, rel- 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-genome as- 
sembly ( WGA) of the human genome were 
enhancements to those used to produce the 
sequence of the Drosophila genome reported 
in detail in (28). 

The WGA assembler consists of a pipeline 
composed of five principal stages: Screener, 
Overlapper, Unitigger, Scaffolder, and Repeat 
Resolver, respectively. The Screener finds 
and marks all microsatellite repeats with less 
than a 6-bp element, and screens out all 
known interspersed repeat elements, includ- 
ing Alu, Line, and ribosomal DNA. Marked 
regions get searched for overlaps, whereas 
screened regions do not get searched, but can 
be part of an overlap that involves unscreened 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
" with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such machines 
operating in parallel. 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
'true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single ' 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
- erage coverage depth is too high to be con- 
sistent with' the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a uhitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the - 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scafifolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with respect to each other, the 
probability of this being wrong is again 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirrning 50-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps; These scaffolds reconstruct 
the majority of the unique sequence within a 
genome. 

For the Drosophila assembly, we engaged 
in a three-stage repeat resolution strategy 
where each stage was progressively more 
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aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10" 7 based on a probabilistic analysis. 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43), For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 



Public Bactias 
(from i 33.421 BACs) 




Bactfgs & Cetera pairs 
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WGA Assembly CSA Assembly 

Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated -by its label, with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process This figure 
summarizes the discussion in the text that defines the terms and phrases used 



www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1311 



some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce, 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

At the final stage of the assembly process, 
. and also at several intermediate points, a 
consensus sequence of every contig. is pro- 
duced Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human, genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In addition, memory was 
a real issue — a straightforward application of 
the software we had built for Drosophila would 
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have required a computer with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute iriirastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 1127 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by . 
scaffolds >100 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence.. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold size was 1.5 Mop, 
the average contig size was 24.06 kbp, and the . 
average gap size was 2.43 kbp, where the dis- 



tribution of each was essentially exponential. 
More than 50% of all gaps were less than 500 
bp long, >62% of all gaps were less than 1 kbp 
long, and no gap was >100 kbp long. Similar- 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1 .22 Mbp long 
Table 3 gives detailed summary statistics for 
the structure of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. . 

The first phase of the CSA strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



Scaffold size 



No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps ^1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps £1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 
.(bp) 
Largest contig (bp) 
% of total contigs 



All 


>30kbp 


>100 kbp 


>500 kbp 


>1000 kbp 




Compartmentalized shotgun assembly 






2,905.568,203 


2748,892,430 


2,700,489,906 


2,489.357,260 


2.248.689.128 


2,653,979,733 


2,524,251,302 


2.491,538,372 


2320,648,201 


2.106.521,902 


53,591 


2,845 


1,935 


1.060 


721 


170,033 


112,207 


107,199 


93,138 


82.009 


116,442 


109,362 


105.264 


92,078 


81,288 


72,091 


69,175 


67,289 


59,915 


53,354 


54,217 


966,219 


1,395,602 


2.348,450 


3,118,848 


15,609 


22.496 


23,242 


24,916 


25,686 


2,161 


2,054 


1,985 


1,832 


1,749 


1,988,321 


1.988,321 


1,988,321 


1,988,321 


1,988,321 


100 




94 


87 


79 




Whole-genome assembly 








2,847,890,390 


2,574,792,618 


2,525,334,447 


2,328,535,466 


2,140,943,032 


2,586,634,108 


. 2,334,343,339 . 


2.297,678,935 


2,143,002,184 


1,983,305,432 


118,968 


2,507 


1,637 


. 818 


554 


221,036 


99,189 


95,494 


84,641 


76,285 


102,068 


96,682 


93,857 


83,823 


75,731 


62,356 


■ . * 60,343 


59,156 


54,079 


49,592 


23,938 


1,027,041 


1.542,660 


2,846,620 


3,864,518 


11,702 


23,534 


24,061 


25,319 


25,999 


2,560 


2,487 


2,426 


2,213 


2,082 


1,224,073 


1,224,073 


1,224,073 


1.224,073 


1,224,073 


100 


90 


89 


83 


77 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5. 1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
maj ority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the * -Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp: Basically, some small amount of 
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assembly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification,' confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3x light-shotgun of 
each BAC is needed. 

The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
scaffolds, for every BAC region constituting 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and BAC-end pairs (18) and sequence tagged 
site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
. use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry; 



Chimeric or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2.906 Gbp in 
span and consisting of 2.654 Gbp of se- 
quence. The chaff, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 1 00 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs >100 kbp, and 
the largest contig was 1 .99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence , of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated." 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
-in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude , cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely > 1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2. 108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University {45). Among the ge- 
nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they, were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types, of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 



In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame- 
work bins. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 
five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds ori the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to. the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to . 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza-. 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, —98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 
chromosome. 

During the scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 



2.7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
chromatin sequence has been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- . 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the unas- 
sembled data or "chaff." We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method. We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 
Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 
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juence for those that were correct (Table 5). The stan- 
at the nu- dard deviations for all Celera libraries were 
:n done for quite small, less than 15% of the insert 
scribed in length, with the exception of a few 50-kbp 
consensus libraries. The 2- and 10-kbp libraries con- 
basis of a tained less than 2% invalid mate pairs, where- 
he quality as the 50-kbp libraries were somewhat higher 
(—10%). Thus, although the mate-pair infor- 
; assembly mation was not perfect, its accuracy was such 
tlysis. In a that measuring valid, misoriented, and mis- 
air of se- separated pairs with respect to a given assem- 
>n the con- bly was deemed to be a reliable instrument 
separation for validation purposes, especially when sev- 
A pair is eral mate pairs confirm or deny an ordering, 
ire in the The clone coverage of the genome was 
e between 39 X, meaning that any given base pair was, 
dard devi- on average, contained in 39 clones or, equiv- 
izes of the alently, spanned by 39 mate-paired reads, 
ampled. A Areas of low clone coverage or areas with a 
i the reads high proportion of invalid mate pairs would 
med"mis- indicate potential assembly 1 problems. We 
tween the computed the coverage of each base in the 
t the reads assembly by valid mate pairs (Table 6). In 
: the stan- summary, for scaffolds >30 kbp in length, 
ed by the less than 1% of the Celera assembly was in 
described regions of less than 3 X clone coverage. Thus, 
imined all more than 99% of the assembly, including 
quence of order and orientation, is strongly supported 
lined how by this measure alone, 
were as a We examined the locations and number of 
s and chi- all misoriented and misseparated mates. In 
3f the ge- addition to doing this analysis on the CSA 
), and how assembly (as of 1 October 2000), we also 
;s was for performed a study of the PFP assembly as of 



5 September 2000 (30, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. The graphic comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



fragment sequences were mapped to 
some 21, Each mate pair uniquely 
orientation and placement (number 
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of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 
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ably than tive splicing and alternative transcription ini- 
hows the tiation and termination sites. Our cells are 

for both able to discern within the billions of base 
t side-by- pairs of the genomic DNA the signals for 
tation of initiating transcription and for splicing to- 
lly fewer gether exons separated by a few or hundreds 
aed chro- of thousands of base pairs. The first step in 
trge gaps characterizing the genome is to define the 

red tick structure of each gene and each transcription 
tze of all unit. 

sis of the The number of protein-coding genes in 
ausedby mammals has been controversial from the 
the two outset Initial estimates based on reassocia- 
srent hu- tion data placed it between 30,000 to 40,000, 
le unfin- whereas later estimates from the brain were 
blies. > 100,000 (55). More recent data from both 

the corporate and public sectors, based on 
extrapolations from EST, CpG island, and 
transcript density-based extrapolations, have 
not reduced this variance. The highest recent 
number of 142,634 genes emanates from a 
report from Incyte Pharmaceuticals, and is 
based on a combination of EST data and the 
association of ESTs with CpG islands (57). 
In stark contrast are three quite different, and 
much lower estimates: one of —35,000 genes 
derived with genome-wide EST data and* 
sampling procedures in conjunction with 
chromosome 22 data (58); another of 28,000 
to 34,000 genes derived with a comparative 
methodology involving sequence conserva- 
tion between humans and the puffer fish Te- 
traodon nigroviridis (59); and a figure of 
35,000 genes, which was derived simply by 
extrapolating from the density of 770 known 
and predicted genes in the 67 Mbp of chro- 
mosomes 21 and 22, to the approximately 
3-Gbp euchromatic genome. 

The problem of computational identifica- 
tion of transcriptional units in genomic DNA 
sequence can be divided into two phases. The 
first is to partition the sequence into segments 
that are likely to correspond to individual 
genes. This is not trivial and is a weakness of 
most de novo gene-finding algorithms. It is 
also critical to determining the number of 
genes in the human gene inventory. The sec- 
ond challenge is to construct a gene model 
that reflects the probable structure of the 
transcript(s) encoded in the region. This can 

»is of compartmentalized shotgun (CSA) and PFP assemblies * 
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be done with reasonable accuracy when a 
full-length cDNA has been sequenced or a 
highly homologous protein sequence is 
known. De novo gene prediction, although 
less accurate, is the only way to find genes 
that are not represented by homologous pro- 
teins or ESTs. The following section de- 
scribes the methods we have developed to 
address these problems for the prediction of 
protein-coding genes. 

We have developed a rule-based expert sys- 
tem, called Otto, to identify and characterize 
genes in the human genome (60). Otto attempts 
to simulate in software the process that a human 
annotator uses to identify a gene and refine its 
structure. In the process of annotating a region 
of the genome, a human curator examines the 
evidence provided by the computational pipe- 
line (described below) and examines how var- 
ious types of evidence relate to one another. A 
curator puts different levels of confidence in 
different types of evidence and looks for 
certain patterns of evidence to support gene 
annotation. For example, a curator may ex- 
amine homology to a number of ESTs and 
evaluate whether or not they can be connect- 
ed into a longer, virtual mRNA. The curator 
would also evaluate the strength of the simi- 
larity and the contiguity of the match, in 
essence asking whether any ESTs cross 
splice-junctions and whether the edges of 
putative exons have consensus splice sites. 
. This kind of manual annotation process was 
used to annotate the Drosophila genome. 

The Otto system can promote observed 
evidence to a gene annotation in one of two 
ways. First, if the evidence includes a high- 
quality match to the sequence of a known 
gene [here defined as a human gene repre- 
sented in a curated subset of the RefSeq 
database (61)], then Otto can promote this to 
a gene annotation. In the second method, Otto 
evaluates a broad spectrum of evidence and 
determines if this evidence is adequate to 
support promotion to a gene annotation. 
These processes are described below. 

Initially, gene boundaries are predicted on 
the basis of examination of sets of overlap- 
ping protein and EST matches generated by a 
computational pipeline (62). This pipeline 
searches the scaffold sequences against pro- 
tein, EST, and genome-sequence databases to 
define regions of sequence similarity and 
runs three de novo gene-prediction programs. 

To identify likely gene boundaries, re- 
gions of the genome were partitioned by Otto 
on the basis of sequence matches identified 
by BLAST. Each of the database sequences 
matched in the region under analysis was 
compared by an algorithm that takes into 
account both coordinates of the matching se- 
quence, as well as the sequence type (e.g., 
protein, EST, and so forth). The results were 
used to group the matches into bins of related 
sequences that may define a gene and identify 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins," each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 



being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- - 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
. the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1 304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 
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and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by elirninating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid-' 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was . allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits ," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (/V) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology) f 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. fRefers to those 
annotations produced by supplying alt available evidence 
to Genscan, 



those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a unique SIM4 -ali gnm ent (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a deterrnination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1 % of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained, in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to "give rise to a 
transcript. In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript. We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig, 8. Bom RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there was not sufficient 
sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did hot overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). . 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to -23,000. . 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would elirriinate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence types — homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence Considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
chromosome diagrams in Fig. 1 . These are a 
very preliminary set of annotations and are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
. cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts , promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that, a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of 
the noncoding attributes of the assembled 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 
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4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
most visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17°/o to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher order repeat 
structures (6*5). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan. Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data show' the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions). ■ 







Total 




Types of evidence 






No. of lines of evidence* 




Mouse 


Rodent 


Protein 


Human 


>1 


>2 


>3 


>4 


Otto 


Number of 


17,969 


17.065 


14,881 


15,477 


16,374 


17.968f 


17,501 


15.877 


12.451 




transcripts 
















Number of 


141,218 


111,174 


89,569 


108.431 


118,869 . 


140,710 


127,955 


99.574 


59.804 




exons 


















De novo 


Number of 


58,032 


14.463 


5,094 


8,043 


9,220 


21 t 350 


8.619 


4,947 


1,904 




transcripts 


















• Number of 


319,935 


48,594 


19,344 


26.264 


40,104 


79,148 


31,130 


17.508 


6,520 




exons 


















No. of exons per 


Otto ' 


7.84 


5.77 


6.01 


6.99 


7.24 


7.81 


7.19 


6.00 


4.28 


transcript 


De novo 


5.53 


3.17 


3.80 


3.27 


4.36 


3.7 


3.56 


3.42 


3.16 



number includes alternative splice forms of the 17,764 genes mentioned elsewhere in the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining —80% of the genome, the 
euchromatic component, is divisible into G-, 
R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed ispchores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bemardi defined the L Qight) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of GH-C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content . 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (59). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



found to have the lowest gene density, X, 4, 
18, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

How valid is Ohno's postulate (71) that 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
pears that the human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
gene, then we see that 605 Mbp, or about 20% 
of the genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13 , 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- 

Table 9. Characteristics of G+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
and genetic analysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3 -Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (75). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates and the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
.magnitude of variability in recombination 
rate will depend on the size of the window 



Isochore 


G+C (%) - 


Fraction of genome 


Fraction of genes 


Predicted* 


Observed 


Predicted* 


Observed 


H3 . 


>48 


5 


9.5 


37 


24.8 


H1/H2 


43-48 


25 


21.2 


32 


26.6 


L 


<43 


67 


69.2 


31 


48.5 



*The predictions were based on Bernardi's definitions (70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto, tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
have more than 20. In 
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the de novo set, 49.3% of the transcripts have one or two exons, and 0.2% have more than 20. 
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examined. Unfortunately, too few meiotic 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG - 
dinucleotides when compared with the entire 
genome {74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81). Lars en et 
al. (76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 
with gene starts, given a set of annotated 
genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et al. (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a ■ 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
. starts (start codons) are contained inside a 
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Fig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
genome (in 50-kbp windows) with the indicated G+C content The percent of the total number of 
genes associated with each G+C bin is represented by the yellow bars. The graph shows that about 
5% of the genome has a G+C content of between 50 and 55%, but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

We also looked at the distribution of CpG 
island nucleotides among various sequence 
classes such as intergenic regions, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence, class. The re- 
sult, of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1 .2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
KNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) -or inactive 
genes (pseudogenes). Genes involved . in 
translational processes and nuclear regulation 
account for nearly 5Q% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 I (continued). Relation among gene density (orange), C+C content 
(green) EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 




dows. The percent of G+C nucleotides was calculated 
windows. The number of ESTs and Alu elements is shown 
window. 



in 100-kbp 
per 100-kbp 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed (84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. 

We believe that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/fuiy291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have higtj confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon-<ontaining 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our prelirninary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentation of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 



5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces. 



Size of the genome (including gaps) 


2.91 Gbp 


Size of the genome (excluding gaps) 


2.66 Gbp 


Longest contig 


1.99 Mbp 


Longest scaffold 


14.4 Mbp 


Percent of A+T in the genome 


54 


Percent of G+C in the genome 


38 


Percent of undetermined bases in the genome 


9 


Most GC-rich 50 kb 


Chr. 2 (66%) 


Least GC-rich 50 kb 


Chr. X (25%) 


Percent of genome classified as repeats 


35 


Number of annotated genes 


26383 


Percent of annotated genes with unknown function 


42 


Number of genes (hypothetical and annotated) 


39,114 


Percent of hypothetical and annotated genes with unknown function 


59 


Gene with the most exons 


Titin (234 exons) 


Average gene size 


27 kbp 


Most gene-rich chromosome 


Chr. 19 (23 genes/Mb) 


Least gene-rich chromosomes 


Chr. 13 (5 genes/Mb), 




Chr. Y (5 genes/Mb) 


Total size of gene deserts (>50O kb with no annotated genes) 


605 Mbp 


Percent of base pairs spanned by genes 


25.5 to 37.8* 


Percent of base pairs spanned by exons 


1.1 to 1.4* 


Percent of base pairs spanned by introns 


24.4 to 36.4* 


Percent of base pairs in intergenic DNA 


74.5 to 63.6* 


Chromosome with highest proportion of DNA in annotated exons 


Chr. 19 (9.33) 


Chromosome with lowest proportion of DNA in annotated exons 


Chr. Y (0.36) 


Longest intergenic region (between annotated + hypothetical genes) 


Chr. 13 (3,038,416 bp) 


Rate of SNP variation 


1/1250 bp 



♦In these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the* hypothetical + 
annotated gene set (39,114 genes), respectively. 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 







Male 






Sex-average 






Female 




Chrom. 








































Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


Max 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


2.81 


1.42 


0.52 


339 


1.76 


0.68 


2 


2.23 


0.78 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 


0.42 


Z71 


1.30 


033 


4 


1.66 


0.67 


0.15 


2.06 


1.04 


0.60 


2.50 


1.40 


0.77 


5 


2.00 


0.67 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 * 


0.62 


6 


1.97 


0.71 


0.28 


2.57 


1.12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0,47 


2.27 


1.21 


0.34 


8 


1.83 


0.73 


0.14 


2.40 


1.05 


0.46 


3.44 


1.36 


0.43 


9 


2.01 


0.99 


0.53 


1.95 


132 


0.77 


2.63 


1.66 


0.82 


10 


3.73 


1.03 


0.22 


3.05 


1.29 


0.66 


2.84 


1.51 


0.76 


11 


1.43 


0.72 


0.31 


2.13 


0.99 


0.47 


3.10 


132 


0.49 


12 


4.12 


0.76 


0.26 


3.35 


1.16 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.95 


0.17 


2.49 


1.19 


032 


14 


3.15 


0.98 


0.18 


2.65 


1.30 


0.62 


3.14 


1.63 


0.75 


15 


2.28 


0.94 


0.34 


2.31 


1.22 


0.42 


2.53 


1.56 


0.54 


16 


1.83 


1.00 


0.47 


2.70 


1.55 


0.63 


4.99 


232 


1.12 


17 


3.87 


0.87 


0.00 


3.54 


1.35 


0.54 


4.19 


1.83 


0,94 


18 


3.12 


1.37 


0.86 


3.75 


1.66 


0.43 


435 


2.24 


0.72 


19 


3.02 


0.97 


0.10 


2.57 


1.41 


0.49 


2.89 


1.75 


0.87 


20 


3.64 


0.89 


0.00 


2.79 


1.50 


0.83 


331 


2.15 


1.34 


21 


3.23 


1.26 


0.69 


2.37 


1.62 


1.08 


2.58 


1.90 


1.18 


22 


1.25 


1.10 


0.84 


1.88 


1.41 


1.08 


3.73 


2.08 


0.93 


X 


NA 


NA 


NA 


NA 


NA 


NA 


3.12 


1.64 


0.72 


Y 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


Genome 


4.12 


0.88 


0.00 


3.75 


1.22 


0.17 


4.99 


1.55 


0.32 
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that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 



The Human Genome 

pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), iamin 
receptors (10%),. translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Cbp sequence length) by means of two different methods. Method 1 uses a CG 
likelihood ratio of >0.6. Method 2 uses a CG likelihood ratio of >0.8. 



Chromosome 22 



Whole genome 
(CS assembly) 





Method 1 


Method 2 


Method 1 


Method 2 


Number of CpG islands 


5,211 


522 


195,706 


26,876 


detected 






Average length of island (bp) 


390 


535 


395 


497 


Percent of sequence 


5.9 


0.8 


2.6 


0.4 


predicted as CpG 








Percent of first exons that 


44 


25 


42 


22 


overlap a CpG island 






Percent of first exons with 


37 


22 


40 


21 


first position of exon 






contained inside a CpG 










Island 










Average distance between 


1,013 


10,486 


2,182 


17.021 


first exon and closest CpG 




island (bp) 










Expected distance between 


3,262 


32,567 


7,164 


55,811 


first exon and closest CpG 




island (bp) 










Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly 


sequence. 


Repetitive elements 




Megabases in 


Percent 


Previously 




assembled 


of 


predicted 






sequences 


assembly 


' (*) (*0 


Alu 




288 


9.9 


10.0 


Mammalian interspersed repeat (MIR) 




66 


2.3 


1.7 


Medium reiteration (MER) 




50 


1.7 


1.6 


Long terminal repeat (LTR) 




155 


5.3 


5.6 . 


Long interspersed nucleotide element 




466 


16.1 


167 


(LINE) 






Total 




1025 


35.3 


35.6 



The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 
- The first of the methods is based on . the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89). All 
pairs of indexed gene strings were then 
aligned in both the forward and reverse di- 
rections with the Smith- Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch —10, with gap open 
and extend penalties of —4 and —1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number: of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
^ gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five" 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 



filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomi2ing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every , domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
real and the shuffled data, with the results on 
the shuffled data being used to estimate the 
false-positive rate. The algorithm after filter- 
ing yielded 1.0,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
. distinct genes. In the shuffled data,* by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
. duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



tions at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
The proteins are not contiguous but span a 
region containing 97 proteins on chromo- 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 23 X 10" 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset). This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rer and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 




human predominant ' fly/worm predominant 



Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
' some 20, for a density of involved proteins of 
20 to 30%. This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
. the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results . gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
duplication in fact best explains many of the 
blocks detected by this genome-wide analysis. 
The regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a m'stinct mouse chro- 
mosomal region. The corresponding mouse 
chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human 'synteny partners than the human , 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species' divergence. 
This dates the duplications, at the latest, before 
divergence of the primate and rodent lineages. 

' This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 

, (or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements;, 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to deterrnine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome, and 
with it a history of the emergence of many of 
the key functions that distinguish us from other 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphisms 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was —1 per 1200 to 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func- 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an es- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human proteins. 

Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can we 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the distribution and attributes of SNPs. 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (97), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
"TSC"; 632,640 SNPs) (98). These data were 
consistent in showing an overall nucleotide di- 
versity of —8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNPs 



1330 



16 FEBRUARY 2001 VOL 291 SCIENCE www3ciencemag.org 



THE HUMAN GENOME 



(101 y 102), The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
tion-to-transversion ratio from 1.57:1 to 
1.89 : 1. When applied to 2.3 Gbp of alignments 
between the Celera and PEP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. 

6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98%. sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were ' discarded. A total of 
2,336,935 dbSNP variants were mapped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP. SNPs in the TSC set mapped to 
585,8 1 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC, 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
these methods was also found by another meth- 
od The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(1 6.4%) between the Kwok and TSC sets is due 

Table 15. Overlap of SNPs from genome-wide 
SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
in the smaller of the two databases compared. 
Total SNP counts for the databases are: Celera- 
PFP, 2,104,820; TSC. 585,811; and Kwok 438,032. 
Only unique SNPs in the TSC and Kwok data sets 
were included. 



TSC Kwok 

Celera-PFP 188,694 158,532 

(0.322) (0.362) 
TSC 72,024 

(0.164) 



to their being the smallest two sets. In addition, 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
of human variation is to tally the frequen- 
cies of the six possible base changes in 
each set of SNPs (Table 16). Previous mea- 
sures of . nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101) y and . our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale. 
There is remarkable homogeneity between 
the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 transition:transversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used it, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure of 
per-site heterozygosity, quantifying the 
probability that a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



site. These data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity from high-quality sequence 
overlaps should be possible, but again, 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
column of the multialignment, the probability 
that two or more distinct alleles are present, 
and the probability of detecting a SNP if in 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of it for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29.73, P < 
0.0001). 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10~ 4 . Nucleotide diversity on 
the X chromosome was 6.54 X 10~ 4 . The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102 \ 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10- 4 for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 1(T 4 (108). 

6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 



Table 16. Summary of nucleotide changes in different SNP data sets. 



SNP data set 


A/G 


C/T 


A/C 


A/T 


C/G 


T/G 


Transition: 


(%) 


(*) 


(%) 


(%) 


{%) 


(%) 


transversion 


Celera-PFP 


30.7 


30.7 


10.3 


8.6 


92 


10.3 


1.59:1 


Kwok* 


33.7 


33.8 


8.5 


7.0 


8,6 


8.4 


2.07:1 


TSCf 


33.3 


33.4 


8.8 


7.3 


8.6 


8.6 


1.99:1 



♦November 2000 release of the NCBI database dbSNP (www.ncLnlm.nih.gov/SNP/) with the method defined as Overlap 
SnpDetectionWithPolyBayes. The submitter of the data is Pui-Yan Kwok from Washington University. fNovember 
2000 release of NCBI dbSNP (www.ncbi.nlm.nih.gov/SNP/) with the methods defined as TSC-Sanger, TSC-WICGR, and 
TSC-WUCSC The submitter of the data is Lincoln Stein from Cold Spring Harbor Laboratory. 



www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1331 



The Human Genome 



Fig. 13. Segmental duplica- 
tions between chromo- 
somes in the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10,310 
pairs of genes in totaL Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
tom within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral 
coalescent (109). Applying well-tested algo- 
rithms for simulating the neutral coalescent 
with recombination {110), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (lll) t we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is si gnifi cant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

To test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic '(missense and silent), in- 
tronic, and 3'-UTR for 10,239 known 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable to the missense-to-silent ratios of 
0.88 and 1.17 found by Cargill et al. (101) 
and by Halushka et al (102), Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 




Number of SNPs / 100 kk> 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113) } and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA- These SNP 
rates were confirmed in the Celera SNPs, which 
also exhibited a lower rate in exons than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 
markers for linkage and association studies, and 
some fraction is likely to have a regulatory 
function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set ' with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
other fully sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain- based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man proteinTCoding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115 y 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will not be computa- 
tionally predicted). We also expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (ii) What are the core functions that 
appear to be common across the animals? 



(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 1 5 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at least 
two lines of supporting evidence. About 
41% (12,809) of the gene products could 
not be classified from this initial analysis 
and are termed proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
. classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting that the majority of 



these unknown-function genes are not real 
genes. Given that most of these additional 
12,095 genes appear to be unique among the 
genomes sequenced to date, many may simply 
represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
Other functions that are highly represented in 
the human genome are the receptors, kinases, 
and hydrolases. Not surprisingly, most of the 
hydrolases are proteases. There are also many . 
proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs in classes of 
genomic regions. 





Size of 


Celera-PFP 


Genomic region 


region 


SNP 


class 


examined 


density 




(Mb) 


(SNP/Mb) 


Intergenic 


2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron 


615 


921 


First intron 


164 


808 


Exon 


31 


529 


First exon 


10 


592 



cell adhesion (577, 1.9%) 
miscellaneous (1318, 
viral protein (100, 0.3%), 
transfer/carrier protein (203, 0.7%) 
transcription factor (1 850, 6.0%) 



nucleic acid enzyme (2308, 73%) 



signaling molecule (376, 1.2%) 



receptor (1543, 5.0%) 



kinase (368, 2.8%) 

select regulatory molecule (988, 3.2%) 

transferase (610,2.0%) 
synthase and synthetase (313, 1.0%) 
oxidoreductase (656, 2.1 %) 

lyase(117 T 0.4%) 
ligasc(56,0.2%) 
.isomerasc (163, Qj%y 

hydrolase (1227, 4.0%) 



chaperoned 59, 0.5%) 

cytosVelctal structural protein (876, 2.8%) 
extracellular matrix (437, 1.4%) 
| linmunoglobulu) (264. 0,9%) 
ion channel (406, 1.3%) 
motor (376, 12%) 

structural protein of muscle (296, 1 .0%) 
protooncogene (902, 2J%) 

select calcium binding protein (34, 0.1%) 
intracellular transporter (350, 1.1%) 
fransporter(533, 1.7%) . 




GO categories 



molecular function unknown (1 2809, 41 .7%) 



Panther categories 
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Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak- 
ers' yeast") (118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 



The Human Genome 

(120), we identified two different cases for 
each pairwise comparison (human-fly and 
human- worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
organisms being compared. Chervitz et ai 
(120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
organisms, and then looked for pairs of genes 
that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to.be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism).. When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tein set, we could not answer this question for 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
We define the evolutionarily conserved set as 
those 1523 human proteins that have strict 
orthologs in both D. melanogaster and C. 
elegans. 

The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
15), mere are several categories that are over- 
represented in the conserved set by a factor of 
—2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, ' 
DNA ligases, DNA- and RNA-processing 
factors, nucleases, and ribosomal proteins). 
The basic transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomefases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between' 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP P-value of <1CT 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e. ( there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 



cytoskclctal structural protein (20, 1.2%) 
chapcrone(16 t 0.9%) x 
cell adhesion (tl,0.6%) v 
miscellaneous (72, 4 3%) y 
viral protein (4, 0l2%) n 
transfer/carrier protein (II, 0.6%) - 

transcription factor (8 1 , 4.7%) 



nucleic acid enzyme (221, 12.9%) 



extracellular matrix (12, 0.7%) 
ion channel (7, 0.4%) 
r motor(l3, 0.8%) 

, structural protein df muscle (8, 0.5%) 
protooncpgene (23, 1 J%) 

intracellular transporter (5 1 » 3.0%) 

transporter (44, 2.6%) 



receptor (23, 13%) 



kinase (69, 4.0%) 



select regulatory molecule (88, 5. 1%) 



transferase (70,4.1%) 




synthase and synthetase (6*» 3.7%) 

oxidoreductase (64, 3.7%) 

lyase (12, 0.7%) 
ligase(M.5%) 



molecular function unknown (613, 35.8%) 



hydrolase (80, 4.7%) 
isomerase(2l, 1.2%) 
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zymes are involved in intermediary metabo- 
lism. The only exception b the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- 
served protein farnilies. 

7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. : \i. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared ~ 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class I and 22 
class II major histocompatibility complex 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family of secreted 4-alpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS^ and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such, as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling. Pathway find- 
ing by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123), .The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
lins and plexins) is that of axbnal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during synaptic 
vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128). We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the, inward-rectifier potassium chan- 
nel famityi and the. : voltage-gated potassium 
channel, alpn'a subunit family. - Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory; The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
* regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H), 
D. meianogaster (F), C etegans (W), S. cerevisiae (Y), and A. thatiana (A), The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



more than one cellular process. Results of the Pfam analysis may differ from 
results obtained based on human curation of protein families, owing to the 
limitations of large-scale automatic classifications. Representative examples 
of domains with reduced counts owing to the stringent E value cutoff used for 
this analysis are marked with a double asterisk (**). Examples include short 
divergent and predominantly atpha-helical domains, and certain classes of 
cysteine-rich zinc finger proteins. 



Accession 
number 


uruiTiain name 


Domain description 


H 


F 


w 


Y 


A 






Developmental and homeostatic regulators 










PF02039 


Adrenomedullin 


Adrenomedullin 


1 


0 


0 


0 


0 


PF00212 


ANP 


Atrial natriuretic peptide 


2 


0 


0 


0 


0 


PF00028 


Cadherin 


Cadherin domain 


100 (550) 


14(157) 


16(66) 


0 


0 


PF00214 


Calc.CGRP IAPP 


caicitonin/CGRP/IAPP family 


3 


0 


0 


0 


0 


PF01110 


CNTF 


Ciliary neurotrophic factor 


1 


0 


0 


0 


0 


PF01093 


Clusterin 


Clusterin 


3 


0 


0 


0 


0 


PF00029 


Connexin 


Connexin 


14(16) 


0 


0 


0 


0 


PF00976 


ACThLdomain 


Corticotropin ACTH domain 


1 


0 


0 


0 


0 


PF00473 


CRF 


Corticotropin-releasing factor family 


2 


1 


0 


0 


0 


PF00007 


Cysjcnot 


Cystine-knot domain 


10(11) 


2 


0 


0 


0 


PF00778 


DIX 


Dix domain 


5 


2 


4 


0 


0 


PF00322 


Endothelin 


Endothelin family 


3 


0 


0 


0 


0 


PF00812 


Ephrin . 


cpnrin 


7(8) 


2 


4 


0 ■ 


0 


PF01404 


EPh Ibd 


Ephrin receptor Ugand binding domain 


12 


2 


1 


0 


0 


PF00167 


FCF 


Fibroblast growth factor 


23 


1 


1 


0 


0 


PF01534 


Frizzled 


Frizzled/Smoothened family membrane region 


9 


7 


3 


0 


0 


PF00236 


Hormone6 


Glycoprotein hormones 


1 


0 


0 


0 


0 


PF01153 


Glypican 


Glypican 


14 


2 


1 


. 0 


0 


PF01271 


Granin 


Grainin (chromogranin or secretogranin) 


3 


0 


0 


0 


0 


PF02058 


Guanylin 


Guanylin precursor 


1 


0 


0 


0 


0 


PF00049 


Insulin 


Insulin/IGF/Relaxin family 


7 


4 


0 


0 


0 


PF00219 


IGFBP 


•iniuun-iiNc gruwui racior Dinoing proteins 


10 


0 


. 0 


0 


0 


PF02024 


Leptin 


Leptin 


1 


0 


0 


0 


0 


PF00193 


Xlink 


li inn. \rryaiuron uinoingj 


13(23) 


0 


1 


0 


0 


PF00243 


NGF 


incivc growxn lacxor Tarniiy 


3 


.0 


0 N 


0 


0 


PF02158 


Neuregulin 


Neuregulin family 


4 


0 


0 


0 


0 


PF00184 


HormoneS 


Neurohypophysial hormones 


1 


0 


0 


0 


0 


PF02070 


NMU 


Neuromedin U 


1 


0 


0 


0 


0 


PF00066 


Notch 


Notch (DSL) domain 


3f5) 


2(4) 


2(6) 


0 


0 


PF00865 


Osteopontin 


Osteopontin 


1 


0 


0 


0 


0 




Hormone3 


Pancreatic hormone peptides 


3 


0 


0 


0 


0 




Parathyroid 


Parathyroid hormone family 


2 


0 


0 


0 


0 


rruu ic.3 


Hormone2 


Peptide hormone 


5(9) 


0 


0 


0 


0 




PDGr 


Platelet-derived growth factor (PDGF) 


5 


1 


0 


0 


0 


rrU lH\Jj 


Sema 


Sema domain 


27(29) 


8(10) 


3(4) 


0 


0 




Somatomedin_B 


Somatomedin B domain 


5(8) 


3 


0 


0 


0 


rruu iuj 


Hormone 


Somatotropin 


1 


0 


0 


0 


0 . 


PF0220A 


^nrh 


Sorbin homologous domain 


2 


0 


0 


0 


0 


rrut*rvj*t 


ji.r 


Stem cell factor 


2 


0 


0 


0 


0 


PF01034 




Syndecan domain 


3 . 


1 


1 


0 


0 


ppnno?n 

rrUWty 


1 N r K_Co 


TNFR/NGFR cysteine-rich region 


17(31) 


1 


0 


0 


0 




Tf~,F_ft 


Transforming growth factor p-like domain 


27 (28) 


6 


4 


0 


0 


PF01099 


t IfArnalrtKln 
uiciuglUtJin 


Uteroglobin family 


3 


0 


0 


0 


0 


PF01160 


Ooiods npiirnnpn 


Vertebrate endogenous opioids neuropeptide 


3 


0 


0 


0 


0 


PF00110 


Wnt 


Wnt family of developmental signaling proteins 


18 


7(10) 


5 


0 


0 






Hemostasis 












rrU 


ANATO 


Anaphylotoxin-Uke domain 


6(14) 


0 


0 


0 


0 


PF00386 


C1a 


C1q domain 


24 


0 


0 


0 


0 


PF00200 


Disintegrin 


Disintegrin 


18 


2 


3 


0 


0 


PF00754 


F5_F8_type_C 


F5/8 type C domain 


15(20) 


5(6) 


2 


0 


0 


PF01410 


COLFl 


Fibrillar collagen C- terminal domain 


10 


0 


0 


0 


0 


PF00039 


Fn1 


Ftbronectin type I domain 


5(18) 


0 


0 


0 


0 


PF00040 


Fn2 


Fibronectin type It domain 


11(16) 


0 


0 


0 


0 


PF00051 


Kringle 


Kringle domain 


15(24) 


2 


2 . 


0 


0 


PF01823 


MACPF 


MAC/Perforin domain 


6 


0 


0 


0 


0 


PF00354 


Pentaxin 


Pentaxin family 


9 


0 


0 


0 


0 


PF00277 


SAA_proteins 


Serum amyloid A protein 


4 


0 


0 


0 


0 


PF00084 


Sushi 


Sushi domain (SCR repeat) 


53 (191) 


11(42) 


8(45) 


0 


0 


PF02210 


TSPN 


Thrombospondin N-terminal-like domains 


14 


1 


0 


0 


0 


PF01108 


Tissuejac 


Tissue factor 


1 


0 


0 


0 


0 


PF00868 


Transglutamin.N 


Transglutaminase family 


6 


1 


0 


0 


0 


PF00927 


Transglutamin.C 


Transglutaminase family 


8 


1 


0 


0 


0 
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Accession 
number 



Domain name 



Domain description 



W 



PF00594 Gla Vitamin K-dependent carboxylation/gamma- 11 0 

carboxyglutamic (CLA) domain 

immune response 

PF00711 Defensin_beta Beta defensin 1 0 

PF00748 Calpainjnhib Calpain inhibitor repeat 3 (9) 0 

PF00666 Cathelicidins Cathelicidins 2 0 

PF001Z9 MHCJ Class I histocompatibility antigen, domains alpha 1 18 (20) 0 

and 2 

PF00993 MHCJI.alpha** Class II histocompatibility antigen, alpha domain 5 (6) 0 

PF00969 MHCJLbeta** Class II histocompatibility antigen, beta domain 7 0 

PF00879 Defensin_propep Defensin propeptide 3 0 

PF01109 GM_CSF Granulocyte-macrophage colony-stimulating factor 1 0 
PF00047 Ig Immunoglobulin domain 381 (930) 125(291) 

PF00143 Interferon Interferon alpha/beta domain 7(9) 0 

PF00714 IFN-gamma Interferon gamma 10 

PF00726 IL10 lnterleukin-10 10 

PF02372 IL15 lnterleukin-15 1 o 

PF00715 !L2 lnterleukin-2 1 0 

PF00727 IL4 lnterleukin-4 1 0 

PF02025 IL5 lnterleukin-5 1 0 

PF01415 IL7 lnteiieukin-7/9 family 1 0 

PF00340 IL1 lnterleukin-1 7 0 

PF02394 IL1_propep lnterleukin-1 propeptide 1 o 

PF02059 IL3 lnterleukin-3 1 0 

PF00489 IL6 I nterleukin-6/G-CSF/MCF family 2 0 

PF01291 UF.OSM Leukemia inhibitory factor (LIF)/oncostatin (OSM) 2 0 

family 

PF00323 Defensins Mammalian defensin 2 0 

PF01091 PTN_MK PTN/MK heparin-binding protein 2 0 

PF00277 SAA_proteins Serum amyloid A protein • 4 0 

PF00048 IL8 Small cytokines (intecrine/chemokine), 32 0 

interleukin-8 like 

PF01582 TIR . TIR domain 18 8 

PF00229 TNF TNF (tumor necrosis factor) family 12 0 

PF00088 Trefoil Trefoil (P-type) domain 5(6) 0 

PI-PY-rho CTPase signaling 

PF00779 BTK BTK motif 5 1 
PF00168 C2 C2 domain 73(101) 32(44) 

PF00609 DACKa Diacylglycerol kinase accessory domain (presumed) 9 4 

PF00781 DAGKc Diacylglycerol kinase catalytic domain (presumed) 10 8 

PF00610 DEP Domain found in Dishevelled, Egl-10 f and 12(13) 4 

Pleckstrin (DEP) 

PF01363 FYVE FYVE zinc finger 28(30) ' 14 

PF00996 GDI GDP dissociation inhibitor 6 2 

PF00503 G-alpha G-protein alpha subunit 27 (30) 10 

PF00631 G-gamma G-protein gamma like domains 16 5 

PF00616 RasGAP GTPase-activator protein for Ras-tike GTPase 11 5 

PF00618 RasGEFN Guanine nucleotide exchange factor for Ras-like 9 2 

GTPases; N-terminal motif 

PF00625 Cuanylatejcin Guanylate kinase 12 8 

PF02189 ITAM Immunoreceptor tyrosine-based activation motif 3 0 
PF00169 PH PH domain 193(212) 72(78) 

PF00130 DAG_PE-btnd Phorbol esters/diacylglycerol binding domain (C1 45 (56) 25 (31) 

domain) 

PF00388 PI-PLC-X Phosphatidylinositol-specific phosphotipase C, X 12 3 

domain 

PF00387 PI-PLC-Y Phosphatidylinositol-specific phosphotipase C, Y * 11 2 

domain 

PF00640 PID Phosphotyrosine interaction domain (PTB/PID) 24(27) 13 

PF02192 PI3K_p85B PB-kinase family, p85-binding domain 2 1 

PF00794 PI3K_rbd P!3-kinase family, ras-binding domain 6 3 

PF01412 ArfGAP Putative GTP-ase activating protein for Arf 16 9 

PF02196 RBD Raf-like Ras-binding domain 6(7) 4 

PF02145 Rap_GAP Rap/ran-GAP 5 4 

PF00788 RA Ras association (RalGDS/AF-6) domain 18(19) 7(9) 

PF00071 Ras Ras family 126 56(57) 

PF00617 RasGEF RasGEF domain 21 8 

PF00615 RGS Regulator of G protein signaling domain 27 6(7) 

PF02197 Rlla Regulatory subunit of type II PKA R-subunit 4 1 



0 
0 
0 
0 

0 
0 
0 
0 

67 (323) 
0 
0 
0 
O 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

2 
0 
2 

0 

24(35) 
7 
8 
10 

15 
1 

20(23) 
5 
8 
3 

7 
0 

65 (68) 
26(40) 



11(12) 
1 
1 
8 
1 
2 
6 
51 

12(13) 
2 



0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0. 
0 
. 0 
0 
0 
0 

0 
0 
0 
0 

0 
0 
0 

0 

6(9) 
0 
2 
5 

5 
1 
2 
1 
3 
5 

1 
0 
24 
1(2) 

1 



0 
0 
0 
6 
0 
0 
1 
23 
5 
1 
1 



0 
.0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

131 (143) 
0 
0 

0 

66(90) 
6 

' 11(12) 
2 

15 
3 
5 
0 
0 
0 

4 
0 
23 
4 



0 
0 
0 

15 
0 
0 
0 

78 
0 
0 
0 
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The Human Genome 



Accession 
number 


L/ornain name 


Domain description 


n 


c 

r 


w 


T 


A 


PrUUb^U 


RnoGAP 


RnoGAP domain 


59 


19 


20 


9 


8 


PF00621 


RhoCEF 


RhoGEF domain 


46 


23 (24) 


18 (19) 


3 


0 


rrUUOJO 


CAM 


sam domain (Sterile alpha mottrj 


29 (31) 


lb 


8 


3 


6 


PF01369 


Sec7 


Sec7 domain 


13 


5 


5 


5 


9 


PF00017 


SH2 


Src homology 2 (SH2) domain 


87 (95) 


33 (39) 


44(48) 


1 


3 


PruOUlo 


SH3 


Src homology 3 (SH3) domain 


143 (182) 


55 (75) 


46 (61) 


23 (27) 


4 


r ru lUt/ 


CTAT 
il Al 


blAl protetn 


7 


1 


1 (2) 


0 


0 


rrUU/yU 


\/l_IC 
VHi 


vhs domain 


4 


2 


4 


4 


8 


PF00568 


WH1 


WH1 domain 

Domains involved in apoptosis 


7 


2 


2(3) 


1 


0 


DCAAvl C*5 

rrUU4b£ 


D-l *> 
BCl*£ 


Bel- 2 


9 


2 


1 


0 


0 


OCA 3 1 OA 


BH4 


Bel- 2 homology region 4 


3 


0 


1 


0 


0 


DCCIAC1Q 

rrUDol9 


CARD 


Caspase recruitment domain 


16 


0 


2 


0 


0 


PF0Q531 


Death 


Death domain 


16 


5 


7 


0 


0 


PFO 1335 


DED 


Death effector domain 


4(5) 


0 


0 


0 


0 


PF02179 


BAG 


Domain present in Hsp70 regulators 


5(8) 


3 


2 


1 


5 


PF00656 


ICE__p20 


ICE- like protease (caspase) p20. domain 


11 


7 


3 


0 


0 


PF00653 


BIR 


inhibitor of Apoptosis domain 

Cytoskeletat 


8(14) 


5(9) 


2(3) 


1(2) 


0 


PF00022 


Actin 


Actin 


61 (64) 


15(16) 


12 


9(11) 


24 


PF00191 


Annexin 


Annexin 


16(55) 


4(16) 


4(11) 


0 


6(16) 


PF00402 


Calponin 


Calponin family 


13(22) 


3 


7(19) 


0 


0 


PF003 73 


Band_41 


FERM domain (Band 4.1 family) 


29 (30) 


17(19) 


11(14) 


0 


0 


PF00880 


Nebulin.repeat 


Nebulin repeat 


4(148) 


1(2) 


1 


0 


0 


rrUUOo 1 


Plectin_repeat 


Plectin repeat 


2(11) 


0 


0 


0 


0 


PFU0435 


Spectrin 


Spectrin repeat 


31 (195) 


13(171) 


10 (93) 


0 


0 


D Zf\f\A 1 o 

PrUu41o 


Tubulin-binding 


Tau and MAP proteins, tubulin-binding 


4(12) 


1 (4) 


2(8) 


0 


0 


PF00992 


Troponin 


Troponin 


4 


6 


8 


0 


0 




Vnr 


\/JI|* L ^. _ J_ • _ _ J _ _ • 

Viiun headpiece domain 


5 


2 


2 


0 


5 


rrU 


Vtnculin 


Vinculin family 

ECM adhesion 


4 


2 


1 


0 


0 


PF01391 


Collagen 


Collagen triple helix repeat (20 copies) 


65 (279) 


10(46) 


174 (384) 


0 


0 


PF0U13 


C4 


C- terminal tandem repeated domain in type 4 
procollagen 


6(11) 


2(4) 


3(6) 


0 


0 


PF00431 


CUB 


CUB domain 


47 (69) 


9(47) 


43(67) 


0 


0 


PF00008 


ECF 


EGF-like domain 


108 (420) 


45(186) 


54 (157) 


0 


1 


PF00147 


Fibrinogen^ 


Fibrinogen beta and gamma chains, C-terminal 
globular domain 


26 


10(11) 


6 


0 


0 


PF00041 


Fn3 


Fibronectin type lit domain 


106 (545) 


42 (168) 


34 (156) 


0 


1 


PF00757 


Furin-like 


Furin-like cysteine rich region 


5 


2 


1 


0 


0 


PF00357 


lntegrin_A 


Integrin alpha cytoplasmic region 


3 


1 


2 


0 


0 


PF0Q362 


Integrin.B 


integrins, beta chain 


8 


2 


2 


0 


0 


PF00052 


Laminin.B 


Laminin B (Domain IV) 


8(12) 


4(7). 


6(10) 


0 


0 


PF00053 


Laminin_EGF 


Laminin ECFrlike (Domains III and V) 


24(126) 


9(62) 


11(65) 


0 


0 


PF00054 


Laminin_C 


Lamtnin G domain 


30(57) 


18(42) 


14(26) 


0 


0 


PF00055 


Laminin.Nterm 


Laminin N-terminat (Domain VI) 


10 


6 


4 


0 


0 


PF00059 


Lectin_c 


Lectin C-type domain 


47 (76) 


23(24) 


91 (132) 


0 


0 


PF01463 


LRRCT 


Leucine rich repeat C-terminal domain 


69 (81) 


23 (30) 


7(9) 


0 


0 


rru i *to£ 


I PPMT 


Leucine rich repeat N-terminal domain 


40 (44) 


7(13) 


3(6) 


0 


ft 

0 


PF00f)^7 

rrUUvJ / 


Lui_recepL_a - 


Low-density lipoprotein receptor domain class A 


35 (127) 


33(152) 


27 (113) 


ft 

u 


ft, 


PF00058 


1 ffl rpf*e»nt" K 

LUl 1 CUCUl_U 


Low~ucn5iiy lipoprotein receptor repeat ciass d 


lb (95) 


3 (bbj 


/ \CC) 


A 

V 


n 

u 


PF00530 




Scavenger receptor cysteine-rich domain 


n (46) 


4(SJ 




f\ 

u 


A 

u 


PF00084 


Sushi 


Sushi domain (SCR repeat) 


dd \ i y i ) 


1 1 142) 


8(45) 


u 


A 

u 


PF00090 


Tsp 1 


i nroiTioosponuin lyue i oomain 


AA 


n hi\ 


1 B f 
lb(^/) 


o 


n 


PF00092 


Vwa 


von Willebrand factor type A domain 


34 (58) 


U 


1/(19) 


n, 


i 
i 




VWC 


von Willebrand factor type C domain 


19 (28) 


6(11) 


2(5) 


A 

u 


A 

u 


PF00094 


Vwd 


von wiueuianu idcior type w uornain 

Protein interaction domains 


lb [3b) 






A 

V 


. n 


PF00244 


14-3-3 


14-3-3 proteins 


20 


3 


3 


2 


15 




ARK 


Ank repeat 


145 (404) 


72 (269) 


75 (223) 


12 (Z0) 


DO (111) 


PF00514 


Armadillo seg 


Armadillo/beta-catenin-like repeats 


22(56) 


11(38) 


3(11) 


2(10) 


25(67) 


PF00168 


C2 


C2 domain 


73(101) 


32(44) 


24(35) 


6(9) 


66 (90) 


PF00027 


cNM Poinding 


Cyclic nucleotide-binding domain 


26(31) 


21 (33) 


15(20) 


2(3) 


22 


PF01556 


DnaJ_C 


DnaJ C terminal region 


12 


9 


5 


3 


19 


PF0Q226 


DnaJ 


DnaJ domain 


44 


34 


33 


20 


93 


PF00036 


Efhand** 


EFhand 


83(151) 


64(117) 


41 (86) 


4(11) 


120 (328) 


PF00611 


FCH 


Fes/ClP4 homology domain 


9 


3 


2 


4 


' 0 


PF01846 


FF 


FF domain 


4(11) 


.' 4(10) 


3(16) 


2(5) 


4(8) 


PF00498 


FHA 


FHA domain 


13 


15 


7 


13(14) 


17 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
have at least 10 genes belonging to four 
different families involved in myelin produc- 

Table 18 [Continued) 



tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteoiipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 

Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Accession 
number 


Domain name 


Domain description 


PFD0254 


FKBP 


FKBP-type pept idyl- prolyl cis-trans isomerases 


PF01590 


CAF 


CAF domain 


PF01344 


Ketch 


Kelch motif 


PFOO560 


LRR** 


Leucine Rich Repeat 


PFOOQ17 
rrUU3 1 i 


MATH 


MATH domain 




PAS 


PAS domain 




PDZ 


PDZ domain (Also known as DHR or CLCF) 


PF00169 


PH 


PH domain 


PF01535 


PPR** 


PPR repeat 


PF00536 


SAM 


SAM domain (Sterile alpha motif) 


PF01369 


Sec7 


Sec 7 domain 


PFOO017 


SH2 


Src homology 2 (SH2) domain 


PF00018 

r i www i w 


SH3 


Src homology 3 (SH3) domain 


PF01740 


ST AS 


ST AS domain 


PF00515 


TPR** 


TPR domain 


PF00400 


WD40** 


WD40 domain 


PF00397 


WW 


WW domain 


PF00569 

r l V w \j j 


zz 


ZZ-Zinc finger present in dystrophin, CBP/p300 






/vuucar /# itcidLi/Ufi uw/i 


PrOl /D4 




A *y (\_ 1 j Tin/* finflaf 

rttW-UKc zinc. linger 


rru l joo 




ARID DK1A KinHtna Hnmain 


PF014Zo 


DAU 

BAM 


BAH domain 


Pr 00643 


ft Q k rtv ** 

L\-o_UOX 


B-box zinc finger 




OOfT 


RPPA1 C Torminnc fRRf~T\ Hnmain 




orornouornain 


RrnmnHnmaln 


PrOUba 1 


D 1 D 


RTR/DfY7 rfrtmain 


PrUU 145 


u N A_metny lase 


cyiosine-speciTic uina rneinyiase 


rrvvjoj 


i.ni ui nu 


rhrnmn' ffHRrnmatin Organization Modifier^ 






domain 


rrUU 1 £3 




Care histone H2A/H2B/H3/H4 


rrUU I 34 


Cyclin 


Cyclin 


PF0O27O 


DEAD 


DEAD/DEAH box helicase 


PF01529 


Zf-DHHC 


DHHC zinc finger domain 


PF00646 


F-box** 


F-box domain 


PF002S0 


Fork head 


Fork head domain 


PF00320 


CATA 


CATA line finder 


PF01585 


C- patch 


G -patch domain 


Pf 00010 

f 1 W WW 1 w 


HLH** 


Helix-loop- helix DNA-binding domain 


PF00850 


Hist deacetyl 


Histone deacetylase family 


PF00046 


Homeobox 


Homeobox domain 


PF01833 


TIG 


IPT/TIC domain 


PF02373 


ImiC 


JmjC domain 


PF02375 . 


JmiN 


JmjN domain 


PF00013 


KH-domain 


KH domain 


PF01352 


KRAB 


KRAB box 


PF00104 


Hormone.rec 


Ligand-binding domain of nuclear hormone 






receptor 


PF00412 


UM 


UM domain containing proteins 


PF00917 


MATH 


MATH domain 


PF00249 


Myb.DNA-binding 


Myb-like DNA-binding domain 


PF02344 


Myc-LZ 


Myc leucine zipper domain 


PF01753 


Zf-MYND 


MYND finger 


PF00628 


PHD. 


PHD-finger 


PF00157 


Pou 


Pou domain — N-terminal to homeobox domain 


PF02257 


RFX_DNAJ>inding 


.RFX DNA-binding domain 


PF00076 


Rim 


RNA recognition motif (a,Ha. RRM, RBD, or RNP 






domain) 


PF02037 


SAP 


SAP domain 


PF00622 


SPRY 


SPRY domain 


PF01852 


START 


START domain 


PF00907 


T-box 


T-box 



H 

15(20) 
. 7(8) 
54(157) 

25 (30) 
11 

18(19) 
96(154) 
193 (212) 
5 

29(31) 
13 
87 (95) 
143 (182) 
5 

72(131) 
136(305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 
97 (98) 
3(4) 
24(27) 

75(81) 
19 

63 (66) 
15 
16 

35(36) 

11(17) 
18 

60(61) 
12 

160 (178) 
29(53) 
10 
7 

28 (67) 
204(243) 
47 

62 (129) 
11 

32(43) 
1 

14 
68 (86) 
15 
7 

224(324) 

15 
44(51) 

10 
17(19) 



W 



7(8) 


7(13) 


4 


24(29) 


2W 


1 


0 


10 


12(48) 


13(41) 


3 


102 (178) 


24(30) 


7(11) 


1 


15(16) 


5 


88 (161) 


1 


61 (74) 


9(10) 


6 


1 


13(18) 


60 (87) 


46(66) 


2 


5 


72(78) 


65(68) 


24 


23 


3(4) 


0 


1 


474(2485) 


15 


8 


3 


6 


5 


5 


5 


9 


33(39) 


44(48) 




3 


55(75) 


46(61) 


23(27) 


4 


1 


6 


2 


13 


39 (101) 


28(54) 


16(31) 


65 (124) 


98 (226) 


72(153) 


56(121) 


167 (344) 


24(39) 


16(24) 


5(8) 


11(15) 


13 


10 


2 


10 


2 


2 


0 


8 


6 


4 


2 




7(8) 


4(5) 


5 


.21(25) 


1 




0 


0 


10(18) 


23(35) 


10(16) 


12(16) 


16(22) 


18 (26) 


10(15) 


28 


62(64) 


86 (91) 


1(2) 


30 (31) 


1 


0 


0 


• 13(15) 


14(15) . 


17(18) 


1(2) 


12 



5 
10 
48(50) 
20 
15 

20(21) 
5(6) 
15 
44 
5(6) 
100 (103) 
1103) 
4 
4 

14(32) 
0 
17 

33(83) 
5 

18(24) 
0 
14 

40(53) 
5 
2 

127(199) 



71 (73) 
10 

55(57) 
16 

309(324) 
15 
8(10) 
13 
24 
8(10) 
82 (84) 
5(7) 
6 
2 

17(46) 
0 

142(147) 

33(79) 
88(161) 
17(24) 
0 
9 

32 (44) 
" 4 
1 

94(145) 



8 5 

10(12) 5(7) 

2 6 

8 22 



8 
11 

50(52) 
7 
9 
4 
9 
4 
4 
5 
6 
2 
4 
3 

4(14) 
0 
0 

4(7) 
1 

15(20) 
0 
1 

14(15) 
0 
1 

43 (73) 

5 
3 
0 
0 



48 
35 
84(87) 
22 

165(167) 
0 
26 
14(15) 
39 
10 
66 
' 1 
7 
7 

27(61) 
0 
0 

10(16)' 
61 (74) 
243 (401) 
0 
7 

96 (105) 
0 
0 

232 (369) 

6(7) 
6 
23 
0 
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The Human genome 



Accession 
number 


Domain name 


Domain description 


H 


F 


W 


Y 


A 




7f TA7 

L\- 1 Ai. 


TAT ftnrtji- 

TAZ. ringer 


2(3) 


1 (2) 


6(7) 


0 


10(15) 


PF01285 


TEA 


TEA domain 


4 


1 




1 


o 


PF02176 


Zf-TRAF 


TRAF-type zinc finger 


6(9) 


1(3) 




0 


2 


PF00352 


TBP 


Transcription factor TFIID {or TATA-binding 


2(4) 


4(8) 


2(4) 


M2) 


2(4) 






protein, TBP) 








PF00567 


TUDOR 


TUDOR domain 


9(24) 


9(19) 


4(5) 


0 


2 


PF00642 


• Zf-CCCH 


Zinc finger C-x8-C-x5-C-x3-H type (and similar) 


17(22) 


6(8) 


22(42) 


3(5) 


31 (46) 


PF00096 


Zf-C2H2** 


Zinc finger, C2H2 type 


564 (4500) 


234(771) 


68(155) 


34(56) 


21 (24) 


PF00097 


Zf-C3HC4 


Zinc finger, C3HC4 type (RING finger) 


135(137) 


57 


88 (89) 


18 


298 (304) 


PF00098 


Zf-CCHC 


Zinc knuckle 


9(17) 


6(10) 


17(33) 


7(13) 


68 (91) 



(Tables 18 and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-^ (TGF-P), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 1 2 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (132), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (133). A similar expansion in humans 
is noted in structural proteins that constitute the 
acrin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (3 5 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



Comparison across the five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and IT AM domains involved 
in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
binding nuclear hormone receptor class of tran- 
scription factors compared with the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the . combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



homeodomains alone or in combination with 
Pou and LM domains in all of the animal 
genomes. Li plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VP1 
and AP2 domain-<ontaining proteins (134). 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation. 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served. An interesting observation is that 
worms and humans have approximately the 
same number , of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domains with 
significant combinatorial diversity. 

Hemostasia Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FEMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. Li addition, there has been extensive re- 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflammatory conditions 
{135, 136). ADAMs are a family of integral 
membrane proteins- with a pivotal role in fibrin- 
ogenolysis and modulating interactions be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-a, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (735). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes (137). We enumerated 
the protein counts of central adaptor and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor-domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domaiiwontaining proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxygenase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, and 
they and their activators have been implicated 



in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number of glyceraldehyde-3 -phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3 in the fly, and 4 in the worm). There 
is, however, evidence for many retrotrans- 



posed GAPDH pseudogenes (139) y which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as a conserved enzyme involved in 
basic metabolism found across all phyla from 
bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 



Table 19. Number of proteins assigned to selected Panther families or subfamilies in H. sapiens (H), D. 
metanogaster (F), C eiegans (W), S. cerevisiae (Y), and A thaliana (A). 



Panther family/subfamily* 


H 


F 


w 


Y 
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Neural structure, function, development 








Ependymin 


1 


0 


0 


0 


0 


Ion channels 












Acetylcholine receptor 


17 


12 


56 


0 


0 


Amiloride-sensitive/degenerin 


11 


24 


27 


0 


0 


CNG/EAG 


22 


9 


9 


0 


30 


IRK 


16 


3 


3 


0 


0 


ITP/ryanodine 


10 


2 


4 


0 


0 


Neurotransmitter-gated 


61 


51 


59 


0 


19 


P2X purinoceptor 


10 


0 


0 


0 


0 


TASK 


12 


12 


4S 


1 


5 


Transient receptor 


15 


3 


3 


1 


0 


Voltage-gated Ca 2+ alpha 


22 


4 


8 


2 


2 


Voltage-gated Ca 2+ alpha-2 


10 


3 


2 


0 


0 


Voltage-gated Ca 2 * beta 


5 


2 


2 


0 


0 


Voltage-gated Ca 2+ gamma 


1 


0 


0 


0 


0 


Voltage-gated K + alpha 


33 


5 


11 


0 


0 


Voltage-gated KQT 


6 


2 


3 


■' 0 


0 


Voltage-gated Na + 


11 


4 


4 


9 


1 


Myelin basic protein 


1 


0 


0 


0 


0 


Myelin PO 


S 


0 


. 0 


0 


0 


Myelin proteolipid 


3 


1 


0 


0 


0 


Myelin-oligodendrocyte glycoprotein 


1 


0 


0 


0 


0 


Neuropilin 


2 


0 


0 


0 


0 


Plexin 


9 


2 


0 


0 


0 


Semaphorin 


22 


6 


2 


0 


0 


Sy na ptotagm in 


10 

Immune response 


3 


3 


0 


0 


Defensin 


3 


0 


0 


0 


0 


Cytokiney 


86 


14 


1 


0 


0 




i 

1 




0 


0 


0 


rwrcc 


1 


O 


0 


0 


0 


Intercnne alpha 


1 c 
13 


U 


0 


0 


0 


iniclCllFlc Dcta 


c 
3 




0 


0 


0 


inLcieron 


Q 
O 


ft 

u 


0 


0 


0 


11 ILCI icurvil l 




1 


1 


0 


0 


LcUKcmid inniDiiory Factor 


I 


ft 

u 


0 


0 


0 


MCSF 


1 


0 


0 


0 


0 


Peptidoglycan recognition protein 


2 


13 


0 


0 


0 


Pre-B cell enhancing factor 


1 


0 


0 


0 . 


0 


Small inducible cytokine A 


14 


0 


0 


0 


0 


SI cytokine 


2 


0 


0 


0 


0 


TNF 


9 


0 


0 


0 


0 


Cytokine receptorf 


62 


1 


0 


0 


0 


Bradykinin/C-C chemokine receptor 


7 


0 


0 


0 


0 


Fl cytokine receptor 


2 


0 


0 . 


0 


0 


Interferon receptor 


3 


0 


. 0 


0 


0 


Interleukin receptor 


32 


0 


0 


0 


0 


Leukocyte tyrosine kinase 


3 


0 


0 


0 


0 


receptor 












MCSF receptor 


1 


0 


0 


0 


0 


TNF receptor 


3 


0 


0 


0 


0 


Immunoglobulin receptorf 


59 


0 


0 


0 


0 


. T-cell receptor alpha chain 


16 


0 


0 


0 


0 


T-cell receptor beta chain 


15 


0 


0 


0 


0 


T-cell receptor gamma chain 


1 


0 


0 


0 


0 


T-cell receptor delta chain 


1 


0 


0 


0 


0 


Immunoglobulin FC receptor 


8 


0 


0 


0 


0 


Killer cell receptor 


16 


0 


0 


0 


0 


. Polymeric-immunoglobulin receptor 


4 . 


0 


0 


0 


0 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator 

(141) and has even been implicated in apo- 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

Table 19 (Continued) 



The human Genome 

may account for many of these expansions 
[see the discussion above and (143)]. Recent 
evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apoptosis (144). 

There is also a four- to fivefold expansion 
in the elongation factor 1-alpha family 
(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 



Panther family/subfamily* 


H 


F 


W 


Y 


A 


MHC class 1 


22 


0 


0 


0 


0 


MHC class i! 


20 


0 


0 


0 


0 


Other immunoglobulin! 


114 


0 


0 


0 


0 


Toll receptor-related . 


10 


. 6 


0 


0 


0 


Developmental and homeostatic regulators 






Signaling molecules! 












Calcitonin 


3 


0 


ft 

u 


0 


0 


Ephrin 


8 


2 


A 
<t 


0 


0 


FGF 


24 


1 


1 
1 


0 


0 


Glucagon 


4 


0 


o 


0 


0 


Glycoprotein hormone beta chain 


2 . 


0 


o 


0 


0 


Insulin 


1 


0 


o 


0 


0 


Insulin-like hormone 


3 


0 


o 


0 


0 


Nerve growth factor 


3 


0 


0 


0 


0 


Neuregulin/heregulin 


6 


0 


o 


0 


0 


neuropeptide Y 


4 


0 


o 


0 


0 


PDGF 


1 


1 


o 


0 


0 


Relaxin 


3 


0 


o 


0 


0 


Stannocalcin 


2 


0 


o 


0 


0 


Thymopoeitin . 


2 


0 


1 


0 


0 


Thyomosin beta 


4 


2 


o 


0 


0 


TGF-p 


29 


6 


4 


0 


0 


VEGF 


4 


0 


0 


0 


0 


Wnt 


18 


6 


5 


0 


0 


Receptorsf 












Ephrin receptor 


12 


2 


1 


ft 

u 


0 


FGF receptor 


4. 


4 


0 


ft 

u 


ft 


Frizzled receptor 


12 


6 


5 


0 


0 


Parathyroid hormone receptor 


2 


0 


0 


0 


0 


VEGF receptor 


5 


0 


0 


0 


0 


BDNF/NT-3 nerve growth factor 


4 


0 


0 


0 


0 


receptor 












Dual-specificity protein phosphatase 


Kinases and phosphatases 








29 . 


8 


10 


4 


11 


S/T and dual-specificity protein 










kinasef 


395 


198 


315 


114 


1102 


S/T protein phosphatase 


15 


19 


51 


13 


29 


Y protein kinaset 


106 


47 


100 


5 


16 


Y protein phosphatase 


56 


22 


95 


5 


6 


ARF family 


Signal transduction 








55 


29 


27 


12 


45 
0 


Cyclic nucleotide phosphodiesterase 


25 


8 


6 


1 


G protein-coupled receptorstt 


616 


146 


284 


0 


1 


G-protein alpha 


27 


10 


22 


2 


5 


G-protein beta 


5 


3 


2 


1 


1 


G-protein gamma 


13 


2 


2 


0 


0 


Ras superfamily 


141 


64 


62 


26 


86 


G-protein modulators! 






ARF GTPase-activating 


20 


8 


9 


5 


15 
0 


" Neurofibromin 


7 


2 


0 . 


2 


Ras GTPase-activating 


9 


3 


8 


1 


0 


Tuberin 


7 


3 


2 


0 


0 


Vav proto-oncogene family 


35 


15 


• 13 


3 


0 



transposition, and again there is evidence that 
many of these may be pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

Ribonucleoprotein's. Alternative splicing 
results in multiple transcripts from a single, 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications, In this 
set of processes, the most prominent expan- 
sion is the trarisglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein sulfotransferases participate 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149), Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These, include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains — BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly. and worm. Some of 
these relate to the prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or, 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (150). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that we observe in humans. Perhaps 
the best illustration of this trend is the C2H2 
zinc finger-containing transcription factors, 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN.. Recent reports on the prominent use 
of internal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
(757). At the posttransiational level, although 
we provide examples of expansions of some 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 

8 Conclusions 

8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (75, 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers - 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the A. 
thaliana genome, sequencing of mdividual 
BAC clones permitted extension of the se- 
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Panther family/subfamily* 
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79 


28 
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1 
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1 


2 
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8 


10 


0 


0 
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19 
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1 


0 


0 
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5 


0 


1 
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24 


1 


17 


3 
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21 


1 


17 


2 
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2 


24 


2 


16 
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9 


1 


16 


1 


8 
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168 . 


104 


74 


4 


78 


a art d 


5 


0 


0 


0 


0 


Bithoraxoid 


1 


8 


1 


0 


0 


Iroquois class 


7 


3 


1 


0 


0 


Distal-tess 


5 


2 


1 


0 


0 


Engrailed 


2 


2 


1 


0 


0 


UM-containing 


17 


8 


3 


0 


0 


MEIS/KNOX class 


9 


4 


4 


c 


Zo 


NK-3/NK-2 class 


9 


4 


5 


o 


o 


Paired box 


38 


28 


23 


o 


2 


Six 


5 


3 


4 


0 


0 


Leucine zipper 


6 


0 


0 


o 


o 


Nuclear hormone receptorf 


59 


25 


183 


1 


4 


Pou-related 


15 


5 


4 


1 


0 


Runt-related 


3 

ECM adhesion 


4 


2 


0 


0 


Cadherin 


113 


17 


16 


0 


0 


Claudin 


20 


0 


0 


0 


0 


Complement receptor-related 


22 


8 


6 


0 . 


0 


Connexin 


14 


0 


0 


0 


o 


Calectin 


12 


5 


22 


0 


0 






£. 
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1 


0 


o 


ICAM * 


o . 


fl 
\J 


n 
u 


0 


o 


Integrin alpha 


24 


7 
f 


A 


0 


1 


integrin beta 


9 


2 


2 


0 


0 


LDL receptor family 


26 


19 


20 


0 


2 


Proteoglycans 


22 

Apoptosis 


9 


7 


0 


5 


Bd-2 


12 


1 


0 


o 


o 


Calpain 


22 


4 


11 


1 


3 


Calpain inhibitor 


4 


0 


0 


o 


1 


Caspase 


13 

Hemostasis 


7 


3 


o 


0 


ADAM/ADAMTS 


51 


9 


12 


0 


0 


Fibronectin 


3 


0 


0 


0 


0 


Globin ■ 


10 


2 


3 


0 


3 


Matrix metalloprotease 


19 


2 
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f 


0 


3 


Serum amyloid A 


4 


0 


o 


0 


0 


Serum amyloid P (subfamily of 


2 


0 


o 


0 


0 


Pentaxin) 












Serum paraoxonase/arylesterase 


4 


0 


3 


0 


0 


Serum albumin 


4 


0 


0 


0 . 


0 


Transglutaminase 


10 


1 


0 


0 


0 




Other enzymes 








Cytochrome p450 


60 


89 


83 


3 


256 


CAPDH 


46 


3 


4 


3 


8 


Heparan sulfotransferase 


11 


4 


2 


0 


0 




. Splicing and translation 








EF-1alpha 


56 


13 


10 


6 


13 


Ribonucleoproteinsf 


269 


135 


104. 


60 


265 


Ribosomal proteinsf 


812 


111 


80 


117 


256 



•The table lists Panther families or subfamilies relevant to the text that either (i) are not specifically represented by Pfam 
(Table 18) or pi) differ in counts from the corresponding Pfam models. fThis class represents a number of different 
families in the same Panther molecular function subcategory. JThis count includes only rhodopsin-class, secretin- 
class, and metabotropic glutamate-class CPCRs. 



www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1345 



The Human Genome 



quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila t the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of BAC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
. genome shotgun phase and the BAC clone se- 
quencing phase. Our experience with human 
genome assembly suggests mat this will require 
at least 3X co verage of both whole-genome and 
BAC shotgun sequence data. 

8.2 The low gene number in humans 

We have sequenced and assembled —95% of 
the euchrornatic sequence of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 3 8,000) . than the earlier, molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the . 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of- course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should limit this number. As 
was true . at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J, B. S. Haldane speculated in 1937 that a 
population of organisms might have to pay a 
price for the number of genes it can possibly 
carry. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot maintain itself On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), calculated that the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes (755). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (156), Muller* s esti- 
mate for D. melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome (26, 27). These arguments for 
. the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance (161)'. Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level, 
minor alterations in the nature of protein- 
protein interactions, protein modifications, 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

In situ studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes (68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense . fractioa, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome (71). Why 
are there clustered regions of high and low 
gene density, and are these , accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is —70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 



1346 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 



types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modern human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, and admix- 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism — sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo- 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elirnination by selec- 
tion, and the effective population size will 
be smaller ( 166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
ila, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design, of local SNP . densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among . geographic and 
ethnic populations. 

8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 
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then docks on this, and then the complex 
moves there. . . (167) to the exciting area 
of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other "parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any 
meaningful manner with even simplistic mea- 
sures of structural or behavioral complexity. 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 
million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative rnammalian neu- 
roanatomy (169% that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minute 
primate is found to be only about \5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost mm^tinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity mat by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that mere are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-P, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
conclusion that Einstein's brain was more 
complex than that of Drosophila, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 
protein domain, or protein-protein interaction 
measures do not capture context-dependent 
interactions that underpin the dynamics un- 
derlying phenotype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (1 71). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene knockouts provide an 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (772), and yet the usually conspic- 
uous vimentin network is completely absent . 
On the other hand, —30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity," particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. . 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research axe already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- 
notation. The next steps are clear: We must 
define the complexity that ensues when this 
relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
public discussion of this information and its 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
. are "hard-wired" by the genome; and re due- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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